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ABSTRACT 


Pattern  recognition  is  considered  generally,  but  emphasis  is  placed  on 
the  following  points:  optimum  estimation  of  statistical  parameters  so  as  to 
minimize  the  probability  of  incorrect  classification,  non-Gaussian  and  non-stat- 
ionary  situations,  pattern  detection  in  a  continuing  time  series  and  calculation  of 
error  probabilities.  Some  of  the  work  is  specifically  directed  toward  the  problem 
of  radio  station  recognition.  A  design  procedure  for  pattern  recognizing  machines 
is  suggested  which  uses  results  from  this  report  and  other  referenced  sources . 
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1.  Introduction 

The  objective  of  the  present  contract  is  to  study  the  applicability  of  machine 
learning  processes  to  military  detection  and  recognition  problems.  Specifically, 
it  is  desired  to  investigate  the  concept  and  design  of  a  pattern  recognizing  machine 
which  accepts  incomplete  data  subject  to  measurement  errors,  which  maintains  an 
up-to-date  catalog  of  properties  of  previously  identified  objects,  and  which  displays 
the  recommended  classification  as  to  the  identity  of  unknown  objects  together  with  the 
probability  that  the  decision  is  correct.  The  machine  should  also  determine  if  a  new 
pattern  fits  any  of  the  categories  already  observed,  or  if  it  belongs  to  a  new  category 
previously  unobserved.  These  objectives  have  been  paraphrased  from  the  "Statement 
of  Work" . 

Most  of  the  work  which  has  been  done  on  this  contract  has  centered  on  a  specific 
example  of  pattern  recognition,  the  identification  of  radio  stations  from  their  carrier 
fading  curves.  This  example  was  chosen  because  it  allows  samples  to  be  taken  quite 
easily  in  the  laboratory,  and  because  it  is  closely  related  to  certain  practical  problems 
of  interest  to  the  U.  S.  Air  Force.  The  present  contract  continued  the  work  of  a  previous 
contract,  AF-19(604)-6l54,  which  is  reported  in  References  13,  17,  18,  19  and  20.  The 
present  final  report  is  supplemented  by  3  scientific  reports.  References  21,22  and  23. 
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2.1.  Pattern  recognition-general 

It  has  been  convenient  for  most  workers  in  this  field  to  divide  a  pattern  re¬ 
cognizing  machine  into  two  parts;  the  first  part  extracts  certain  numerical  properties 
from  the  pattern,  the  second  part  makes  the  decision  as  to  which  class  tlie  pattern 
belongs  when  it  is  given  the  set  of  properties.  The  present  section  and  most  of  this 
report  is  concerned  with  the  second  of  these  two  parts.  The  basic  process  can  be 
described  in  non-mathematical  terms,  at  least  in  its  simpler  aspects.  Suppose  that 
four  samples  of  patterns  are  available  from  each  of  two  classes,  A  and  B.  If  two 
properties  of  each  pattern  are  measured,  the  samples  might  fall  as  shown  in  Fig.  !• 


Now  if  a  property  vector  X  is  observed  to  fall  as  shown,  most  people  would  not 
hesitate  to  say  that  it  belong  to  class  B.  The  diagram  also  shows  that  the  second 
property  alone  is  sufficient  to  identify  X  as  belonging  to  class  B,  but  that  the  first 
property  used  alone  would  probably  misclassify  X  as  belonging  to  class  A.  The 
projection  of  the  points  on  the  first  property  axis  shows  an  overlap  of  class  A  and  B 
samples,  whereas  the  projection  on  the  second  property  axis  shows  two  distinct 
clusters . 

Some  general  ideas  can  be  obtained  from  the  above  example:  If  the  sample 
points  fall  in  definite  clusters  which  are  well  separated  from  each  other,  and  if  the 
unknown  falls  near  one  of  these  clusters,  the  classification  is  easy.  If  such  clustering 
does  not  take  place,  the  addition  of  more  properties  might  then  cause  it.  It  is  very 
difficult  to  visualize  such  processes  in  a  space  of  many  dimensions;  and  intuition  is 
not  always  sufficient  to  design  an  optimum  machine,  especially  where  the  clusters 
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overlap  slightly.  The  following  sections  are  devoted  to  a  mathematical  theory  of 

this  phase  of  pattern  recognition  which  is  derived  from  classical  statistics  and  from 

communication  and  radar  engineering.  Before  going  on  to  statistical  methods,  it 

should  be  mentioned  that  some  methods  for  pattern  recognition  are  essentially  non- 
g  27  28 

statistical"’  ’  ,  but  these  can  usually  be  considered  within  the  statistical  frame¬ 

work  as  special  cases. 


2.2.  Basic  model  for  pattern  recognition 

A  mathematical  framework  which  is  suited  to  many  types  of  decision  making 
in  property  space  will  first  be  developed  generally,  then  specialized  applications  will 
be  made  and  the  connections  with  other  theories  pointed  out.  The  pattern  recognizers 
considered  here  will  be  designed  from  the  outset  to  minimize  the  probability  of  n^is- 
classification.  Although  it  may  become  necessary  to  compromise  the  optimum 
design  for  various  practical  reasons,  the  relation  of  the  usual  methods  to  the  ideal 
will  thus  be  more  clearly  seen. 


The  basic  difficulty  in  pattern  recognition  stems  from  the  fact  that  the  patterns 
are  generated  by  a  stochastic  process,  not  a  determinate  process.  That  is,  every 
time  the  pattern  generator  sets  out  to  form  a  pattern  of  a  certain  class  the  result 
differs  slightly  from  all  previous  attempts.  This  fact  is  described  in  the  mathematical 
theory  by  making  the  property  vector  x  a  random  variable  governed  by  the  probability 
density  p  (x|a.),  where  is  a  vector  of  statistical  parameters  (means,  variances, 
covariances  ,  etc.  )  pertaining  to  class  j.  In  almost  all  pattern  recognition  situations 
the  true  values  of  the  a.  are  unknown,  and  in  fact  even  the  functional  form  of  p  (x|a.) 
is  unknown.  In  order  to  have  a  tractable  mathematical  model,  it  will  be  assumed 
here  that  the  form  of  p  Ul^j)  known.  This  assumption  is  not  a  significant  restric¬ 
tion  on  the  generality  of  the  theory  since,  if  enough  parameters  are  used,  p  (x|^j) 
can  be  made  to  approximate  an  arbitrary  function  to  within  given  tolerances.  For 
example:  If  p  (x(a.)  is  assumed  to  be  a  Gaussian  density  and  if  x  has  r  components, 
then  a  will  have  ^  components  and  the  best  choice  of  these  may  not  match 

the  actual  density  of  x.  However,  if  the  actual  density  is  found  to  be  non-Gaussian 


then  the  number  of  parameters  can  be  increased  to  include  not  only  means  and 
variances  but  also  skewness,  excess  etc.^^^  In  fact,  the  components  of  rnight  even 
be  made  to  be  semi-invariants,  coefficients  in  a  Gram-Charlier  or  Edgeworth 
expansioi^l  of  in  fact  coefficients  in  any  series  of  orthogonal  functions.  As  will  be 
seen  below,  there  is  a  definite  disadvantage  in  most  practical  applications  of  making 
the  p  (x|£j)  functional  form  too  general,  i.e. ,  of  allowing  too  many  components  in 
the  vector  Uj . 
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Most  pattern  recognition  schemes  require  an  estimation  of  the  parameters 
a.  from  the  given  samples  of  the  property  vectors.  In  the  mathematical  model 
proposed  here,  the  a,  are  themselves  considered  to  be  random  variables  and  their 
values  need  not  be  estimated  explicitly.  A  choice  of  the  values  of  is  assumed  to  be 
made  by  nature  before  any  calibration  or  pattern  recognition  takes  place,  and  these 
values  are  assumed  to  then  remain  fixed  throughout  any  single  experiment*.  The 
stochastic  process  which  generates  the  will  be  described  by  the  density 
A  (Uj,  n^’  where  there  are  K  classes,  and  this  density  can  be  used  to  intro¬ 

duce  any  a-priori  knowledge  the  pattern  recognizer  may  have  of  the  a.  before  the 
calibration  phase  begins.  Such  a-priori  knowledge  may  exist  because  of  a  physical 
analysis  of  the  pattern  generation  mechanism,  from  previous  attempts  at  pattern 
recognition  in  a  time-varying  environment,  or  for  any  other  reason.  As  a  special 
case,  A  (a)  can  be  considered  as  a  flat  density  over  a  very  large  interval  of  a. 
indicating  no  a-priori  knowledge.  Since  a  priori  knowledge  of  the  values  of  the  a. 
is  usually  insufficient,  additional  knowledge  must  be  supplied  through  the  measurement 
of  a  certain  number  of  samples  of  the  property  vector  whose  class  is  given.  The 
samples  of  class  j  will  be  denoted  by  4  j ,  the  double  underline  indicating  that  several 
values  of  each  component  of  the  property  vector  are  given,  i-e.,  4  is  a  function  of 
three  indices,  property,  sample  number,  and  class.  If  the  sample  vector  selections 
are  independent  for  all  classes  and  times,  the  joint  density  for  ^  is 

n  fi  p  I  £:) 

t=  1  i  =  1  ^ 

The  usual  maximum  liklihood  estimates  of  the  a,  would  be  those  values  which  max¬ 
imize  the  above  expression.  Since  there  is  no  assurance  these  values  minimize  the 
misclassification  probability,  they  will  not  be  used. 

A  complete  pattern  recognition  experiment  will  consist  of  four  stochastic 
selections  (some  of  which  are  multiple):  selection  of  the  statistical  parameters 
by  "nature",  selection  of  the  calibration  samples  of  the  property  vectors,  selection 
of  the  class  of  the  unknown  in  the  actual  recognition  phase,  and  selection  of  the  pro¬ 
perty  vector.  The  first  and  third  of  these  can  be  looked  on  as  causes  of  the  second 
and  fourth,  which  might  t^-en  be  called  effects.  This  situation  can  then  be  analyzed 
in  terms  of  Bayes'  rule  as  in  Appendix  1.  The  effects  are  known  to  the  pattern 
recognizer,  the  causes  are  unknown.  If  the  a-priori  probability  of  class  j  is  p. 


*This  condition  is  qualified  somewhat  in  Section  2.  5  below. 
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then  the  joint  probability  of  causes  Q,  j  and  effects  x  is 

d  k  . 

P(c,  e)=p.  AlfLi*  V-k^  n  TT  P  <ii^l  -i^P  I 
■’  t  =  1  i  =1 

Since  causes  a.  are  not  really  desired,  the  above  expression  should  be  integrated 
over  all  values  of  a^.  The  joint  probability  density  of  desired  cause  and  observed 
effects  is  then 


d 

p  (ji  ”  Pj  /  •  •  •  / A  •  •  *^1^)  n 

t  =  1 


n  p  (i^’l  “i--- ^  “k 

i  =  1 


(2) 


This  can  be  simplified  if  the 


are 


assumed  to  be 


statistically  independent: 


The  product  with  the  i  =  j  term  missing  can  be  put  in  more  convenient  form  by 


using 


k 

n 

i/j 


k 

n  h 

i  =  1 

b. 

J 


the  result  being  “N 

d  ,  . 

/P  (fl  “)  IT  P  (ij  1^)  A.  (a)  d  a 

P  (j,  e)  =  p  .  - - —  G  (|) 

•*  d 

/  TT  P  (ij  l“)  Aj  (f.)d  a 

t=l 


where 


G  (i)  = 


(4) 
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Note  that  G  (1)  is  the  joint  density  of  all  calibration  data.  If  P  is  divided  by  the 

joint  density "f  calibration  data  and  property  vector  (of  unknown  pattern),  then  the 

conditional  probability  of  the  cause  of  the  unknown  given  the  calibration  data  and 

unknown  property  vector  will  be  obtained.  The  calibration  data  (4)  and  unknown 

property  vector  (x)  are  statistically  independent  if  the  parameters  a  are  known, 

but  in  the  present  mpdel  they  are  not .  The  joint  density  of  i  and  x  is  given  by 

integrating  Eq.  (1)  over  all  values  of  Oj..  and  summing  for  j  =  1,  2. .  .k;  this 

is  the  same  as  summing  Eq.  (2)  for  j  =  1,  2. .  .k;  but  this  is  the  same  as  summing 

Eq.  (4).  Let  the  ratio  in  Eq.  (4)  be  denoted  by  /i(x|j,  i)>  and  the  joint  probability 

of  4  and  X  by  q  (5,  x),  then  the  conditional  probability  of  the  unknown  class  being 
=  —  s  ~ 

j  given  the  calibration  data  i  and  the  unknown  property  vector  x  is 

p.  4)  G  (i) 

q0li.i)= 

=  q  (x.  6) 


where  q  (x,  4)-  L  p.  M  l^>  ^  ) 

-  =  h=l  s  s 


The  factor  G  drops  out,  giving 


q(j|x,  ^)  = 


Pj  (x|j-  i) 

q  (x|  i) 


where  q  (x|6)  -  L  p,  tl(x|h,  4) 


and  ft  (xjj,  =  ratio  of  Eq.  (4) 


Reference  to  Appendix  1  (with  j  a  x  ^  equivalent  to  Cj  026^62  respectively)  shows  that 
that  the  following  are  the  interpretations  of  the  functions  appearing  in  Eq.  (6):  since 
q  (x  1^)  is  the  conditional  probability  density  of  the  unknown  property  vector  x  given 
the  calibration  data  4  ,  therefore  fi(x|j,  |)  is  the  conditional  probability  density  of 
the  unknown  property  vector  x  given  its  class,  and  given  the  calibration  data  ^  . 

The  connection  between  the  pattern  recognizer  which  results  from  the  above 
model  and  the  more  usual  forms  may  be  seen  quite  easily.  The  identification  of 
class,  j,  made  by  the  above  model  is  that  value  of  j  which  maximizes  Eq.  (4),  or 
if  a  factor  not  dependent  on  j  is  dropped,  which  maximizes: 
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/p(xla)  n  P(e?ha)A  (Q)da 
~  t=  1  ■' 

A  j  (  X  “  P)  ]  ^ 

/  n  p  ^  £ 


The  usual  maximum  liklihood  estimate  of  u^  which  maximizes 

L(s«.6f'...  n  P(ifi  |»,) 

"“J  ‘~J  —J  t  =  1 

See  Crame'r®!  Chapts.  32,  33  and  especially  Section  33.2.  Denote  the  maximum 
liklihood  estimate  of  a  for  class  j  by  .  If  the  sample  size  is  large,  i.e.,  if 
{j>>  1,  then  L  is  a  sharply  peaked  function  of  n,  that  is  L  as  a  function  of  o.  will 
approximate  a  Dirac  6  function,  6  (a  -  a  Under  these  circumstances,  since 
f  6  (a-a*)  f  (a)  d  a  =  f  (a*),  Eq.  (8)  reduces  to 


A,(x;S)  .  Pj 


P  fel«*  )  A|  |o*) 


* 

A.  (a  .) 
J  -J 


=  PjP  (x|a*) 


If  a!  happens  to  be  the  true  value  of  Uj,  then  the  value  of  j  which  maximizes  this 
expiession  represents  the  best  possible  choice,  but  if  Q  .  is  a  maximum  liklihood 
estimate  from  a  finite  sample  there  is  no  reason  to  believe  that  maximizing 
will  be  equivalent  to  maximizing  A  (Equation  8).  However,  it  was  shown  above 
that  maximizing  Eq^  (8)  maximizes  the  probability  of  a  correct  choice  of  unknown 
class  j,  and  it  is  shown  in  Appendix  1  maximizing  Eq.  (8)  will  minimize  the  pro¬ 
bability  of  misclassification  in  a  series  of  trials  of  the  pattern  recognizer.  On  the 
latter  point:  see  also  Section  3.2  below.  It  has  been  stated  in  at  least  one  place 
in  the  literature^^^that  a  maximum  liklihood  estimate  of  a  will  result  in  a  maximum 
liklihood  estimate  of  j;  but  this  is  not  so  since,  although  j  is  a  single  valued  function 
of  a,  a  is  certainly  not  a  single  valued  function  of  j.  In  fact,  even  the  coordinates  of 
the  class  regions  in  x  space  do  not  uniquely  determine  determine  the  set  of  a's. 

A  more  common  statement  seems  to  be  that  it  is  convenient  to  first  get  a  maximum 
liklihood  estimate  of  a*  and  then  to  use  them  in  Eq.  10  (see  for  example  Rao  . 
p.  289,  or  Anderson^ '^p.  137).  Note  that  if  the  true  values  of  a.  are  known  to  be 

a°,  then  A.  (a.)  =  6  (a.  -  a°)  and  Eq.  (8)  becomes  p  p  (x|a°). 

-j  J  -J  ~J  -J  J  J 

A  more  symmetrical  form  for  Eq.  (8)  can  be  obtained  as  follows: 
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Jp  (x|a)Z  (q)A  (Q)da 

A  .  (x  ;  e)  =  Pj  - ^ ^ 

fZjWA.Wd. 

d  I 

where  Z.  (a)  =  []  P  (1. ;  I 
J  “  t=  1  ■’ 

Here  A.  represents  the  knowledge  of  ^  before  the  calibration,  Z.  represents  the 
additional  knowledge  gained  about  a  during  calibration,  and  the  whole  formula  repre¬ 
sents  the  best  way  to  combine  the  a-priori  knowledge  and  the  a-posteriori  knowledge 
in  making  the  final  recognition.  If  the  calibration  sample  is  very  small  A.  will  play 
a  more  important  role. 

Some  people  seem  to  object  to  considering  statistical  parameters  such  as  the 
a's  as  random  variables.  If  the  objection  is  philosophical,  then  it  can  be  answer  by 
enlarging  the  physical  process  to  include  parameter  selection.  Many  statistical 
statements  are  made  about  runs  of  heads  in  coin- tossing  where  the  probability  of 
a  head  is  p,  but  p  itself  could  be  a  random  variable  if  the  coin  to  be  tossed  is  first 
selected  by  random  choice  from  a  box  of  coins  with  various  biases.  If  the  objection 
is  on  the  more  practical  grounds  that  the  a  priori  probability  density  of  the  parameters 
A  (a)  is  not  known  in  practice,  then  it  can  be  answered  that  in  many  cases  it  ^  known 
(from  previous  experiments,,  physical  analysis,  etc. ),  but  in  any  event  one  can  always 
set  A  (a)  equal  to  a  constant  over  a  wide  enough  range  of  a  to  cause  it  to  drop  out  of 
Eq.  (8)  if  no  initial  knowledge  of  the  parameters  is  available.  A  Leica  can  always  be 
set  to  behave  like  a  box  camera  if  desired.  It  might  be  pointed  out  that  the  method  of 
pattern  recognition  leading  to  Eq.  (8)  differs  from  the  usual  one  in  two  ways,  averaging 
over  all  parameters  values  instead  of  using  one  estimated  value,  and  including  an 
a-priori  probability  density  for  the  parameters.  Note  that  the  second  feature  could  be 
obtained  without  the  first  (as  well  as  the  other  way  around)  by  modifying  the  classical 
metho^^^of  parameter  estimation  to  maximize 


The  argument  over  whether  to  include  A  (a  )  or  not  here  is  essentially  the  same  the 
fruitless  arguments  over  the  validity  of  Bayes  '  rule.  Several  sources  of  semantic 
confusion  also  exist;  the  term  random  "variable"  does  not  imply  something  which 
varies,  the  parameter  has  an  expected  value  if  it  is  a  random  variable,  and  hence 
the  "mean  of  the  mean"  exists,  etc. 
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2.  3.  Uncorrelated  Gaussian  model 


In  order  to  provide  a  definite  example  of  the  model  given  above,  consider  the  case 
where  the  property  vector  has  r  components  which  are  uncorrelated  and  normally  dis¬ 
tributed: 

4E 

1  (12) 


p  (  X  I  m,  o)  = 


(x.  -  m..)^ 

'  Ji 

‘T  ^ ^ - 7 - 

^  i=l  cf  .. 

Ji 


r  r 

Jl  jl 


Here  there  are  2  r  statistical  parameters(a  )  consisting  of  means  (m)  and  standard 
deviations  ( o)  related  as  follows  for  class  j; 


a  =  m., 

Jl  Jl 


«  •  •  •  •  I 


Q,  r  +  Z  =  0., 

j, 


j  =  1,  2, .  • .  k 


a,  =  m. 
jr  jr 


a.  ,  =  or , 
J.2r  jr 


The  functions  (a)  will  be  taken  as  constants  over  the  region  of  interest,  i.e.,  no 
a-priori  information  about  the  parameters  is  assumed.  The  right  side  of  Eq.  8  can 
now  be  written  as  a  product  of  similar  ratios,  one  for  each  property: 


A  =  p  n 

J  J  i=  1 


00 

oo  .  . 

/ 

f  P  (X;  Ill><^) 

n  p(4 

jm,  oj  dm  da 

-00 

t)  vi  y 

t=  1  ^  J 

11  / 

00  00  T 

/  /  n  p(e!|’|ni,(7j  dmdo 

-00  0  t=  1 


(13) 


where  p  x|m,o  is  a  single  variable  Gaussian  density  function. 
The  integrations  are  easier  if  the  substitution  P  =  l/o  is  made; 

,2  r 


A. 


P 


Pi  ^ 
yZir  i=  1 


00  00  ,  I 

/  f 

-00  0 


(d+l)m  -  2m(Lj.+x.)  +  Sj.  +  x. 


dm  dP 


00  00 

/  /  P 

-CO  0 


P" 

d-2  T 
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dm 


2m  L,.  +  Sj. 
Jl  jl 


dm  dp 
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where 


(14) 


The  integration  with  respect  to  m  can  be  carried  out  by  completing  the  square  and 
noting  that  the  result  is  simply  the  normal  density  over  all  values,  i.  e.  ,  unity.  The 
integration  with  respect  to  P  can  then  be  carried  out  by  noting  that 


where  T  is  the  Gamma  function. 


for  d  >  Z  . 

This  result  is  more  easily  visualized  if  expressed  in  terms  of  the  sample  mean  M 
and  sample  standard  deviation  D  defined  as  follows: 


(16) 


The  unknown  belongs  to  the  class  which  maximizes 


(17) 


A  factor  depending  only  on  d  has  been  dropped  since  it  cannot  influence  the  choice 
of  class  j.  Note  that  no  approximations  have  been  made  beyond  assuming  normal  un¬ 
correlated  distributions,  statistical  independence  of  various  observed  properties,  and 
that  no  a-priori  knowledge  of  the  means  and  variances  is  available.  In  particular, 
the  division  by  d  in  defining  M  and  D  will  not  affect  the  classification  made  but 
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only  the  form  of  expression  17;  it  does  not  amount  to  an  estimate  of  the  means  and 
variances.  For  large  d  the  bracket  in  expression  17  approaches  the  exponential 
limiti  and  expression  17  becomes 


which  is  just  what  would  be  obtained  if  the  sample  means  and  variances  were  inserted 
in  the  property  vector  probability  density.  This  shows  that  the  usual  method  is  asym- 
totically  optimum  in  minimizing  classification  errors,  since  the  use  of  expressions 
8  and  17  have  been  shown  to  be  optimum  for  any  sample  size  under  the  assumptions 
made  here. 

The  integrations  required  to  get  an  explicit  formula  for  the  correlated 
Gaussian  case  are  very  difficult,  but  it  is  not  yet  known  that  a  closed  form  for  the 
score  function  is  impossible.  The  principal  difficulty  is  in  integrating  a  complicated 
version  of  Eq.  14  over  all  values  of  the  parameters  (means,  variances,  and  co- 
variances)  which  result  in  a  positive  definite  covariance  matrix. 

2.  4  Model  using  ”  ramp”  densities.  A  certain  aspect  of  the  Gaussian  model,  either 
as  described  above  or  in  the  more  usual  estimater  -  recognizers ,  is  quite  unsatis¬ 
fying  intuitively.  Suppose  there  are  two  equally  probable  classes  and  one  property, 
and  that  the  second  class  has  a  larger  mean  than  the  first.  The  set  of  x' s  which  will 
be  identified  as  class  1  will  be  the  set  satisfying 


If  d  =  2,  this  simplifies  to 


The  first  term  places  the  boundary  point  halfway  between  the  two  centers  of  gravity 
of  the  sample  points;  the  second  term  is  a  correction  for  the  case  where  the  vari¬ 
ances  of  the  two  classes  are  different.  The  difficulty  is  that  the  correction  is  always 


toward  the  class  with  the  larger  variance.  To  illustrate  why  this  is  disturbing, 
consider 


e';’-  -■ 


6V  =  > 


M,.  0 


^1  *  -T-=  ^ 


4" = ^ 


*  3.  5 


2  2.  25  +  2.  25  _ 


=  2.25 


and  the  region  identified  as  class  1  is 


.  0  +  3.  5  ,  3  2.25  -1 

X  <  —2 -  +  1 - 2 - 


X  <  2.  93 

Thus,  the  recognizer  will  call  one  of  the  samples  of  the  second  class  ,  §2  •  ^  member 
of  the  first  class  !  This  does  not  seem  reasonable,  for  if  the  points  are  plotted 
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^ ^ 
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-1 

- \ - 
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almost  anyone  who  is  asked  would  place  the  boundary  between  +1  and  +2;  eq.  1.  9 
should,  it  seems,  be  called  a  member  of  class  2.  The  basic  difficulty  is  in  assuming 
a  Gaussian  distribution.  If  the  properties  are  not  normally  distributed  the  above  pro¬ 
cess  is  not  known  to  be  optimum,  and  two  points  is  certainly  much  too  small  a  sample 
to  give  the  slightest  verification  of  whether  the  distribution  is  indeed  Gaussian. 

Some  workers<^^  have  assentially  abandoned  the  statistical  approach,  not  only  for 
this  reason  but  for  others  as  well.  But  it  seems  desirable  for  many  reasons  to  attempt 
to  retain  the  statistical  framework.  Before  developing  this  approach  it  should  be  em¬ 
phasized  that  the  boundary  found  in  the  example  above  (x  =  2.  93)  is  quite  reasonable 
if  it  is  certain  that  the  distributions  are  Gaussian,  since  the  maximum  number  of 
points  on  the  x  axis  may  be  classified  correctly  if  one  of  the  sample  points  is  put  on 
the  wrong  side  of  the  boundary.  In  such  a  case  the  recognizer  is  bound  to  have  a  fair¬ 
ly  high  error  rate  and  the  only  question  is  to  keep  it  as  low  as  possible. 

If  a  person  is  given  samples  of  several  classes  of  pattern  vectors  in  two 
dimensions  and  asked  to  draw  boundary  lines  between  the  regions  which  are  to  be 
classified  into  each  of  the  classes,  he  would  probably  try  to  minimize  the  number  of 


_ f  wWle  at  the  same  time  keeping  the 


boundary  curves  reasonably  simple.  Of  course,  by  using  complicated  curves  and 
separated  regions  all  sample  points  could  be  correctly  classified,  but  the  probability 
of  correctly  classifying  unknown  vectors  would  not  thereby  be  increased.  In  carrying 
out  this  process,  there  is  a  tendency  to  emphasize  the  importance  of  points  near  the 
boundaries  and  to  almost  ignore  points  far  from  any  boundary.  This  is  in  contrast 
to  "  Gaussian"  methods,  which  weight  the  importance  of  points  according  to  how  much 
they  contribute  to  the  knowledge  of  the  means  (equal  weights)  and  to  knowledge  of  the 
variances  (weighted  more  the  farther  from  the  mean). 

It  is  now  desired  to  reconcile  the  intuitive  approach  illustrated  above  with 
the  use  of  Eq.  8  as  a  score  function.  This  can  be  done  by  finding  a  probability  den¬ 
sity  p  ^x|  which  emphasizes  the  sample  points  near  the  boundary  of  the  two  classes. 
Note  that  if  a  large  sample  is  obtained  the  probability  density  may  not  turn  out  to  re¬ 
semble  the  one  used  here,  but  it  must  be  repeated  that  the  aim  is  to  improve  the 
pattern  recognition  process,  not  to  estimate  probability  density  functions.  The  ramp 
density  function  which  is  to  be  given  may  not  improve  the  recognition  accuracy,  but 
it  will  enable  a  mathematical  comparison  between  non-parametric  intuitive  approach¬ 
es  and  more  formal  Gaussian  approaches. 

First,  define  a  continuous  function  ijj  (x)  with  the  following  properties: 

1 

4/ (x)  approaches  1  monotonically  as  x  approaches  infinity.  The  function  +  (x)  -  -j- 

is  an  odd  function  of  x.  This  implies  that  ^  (0)  =  and  that  ^  (x)  approaches  zero 
as  X  approaches  minus  infinity.  The  function  4' (x)  looks  like  a  step  function  with  a 
finite  rise  time: 


The  function  4^(x)  is  not  integrable  and  hence  cannot  be  a  probability  density.  How¬ 
ever,  a  suitable  density  can  be  defined  in  terms  of  4^  (x)  : 

p  (x|  fi,  p)  -  b  4^  Mx  -  (i)  e  (19a) 

Here  £  is  a  very  small  number  chosen  so  that  e|x|  is  small  for  any  x  which  is 
likely  to  arise,  b  is  a  constant  chosen  to  insure  that  the  integral  of  p  is  unity 
(-  00  <  X  <  oo),  p  is  a  parameter  locating  the  position  of  the  step,  and  Pisa  second 
statistical  parameter  whose  magnitude  gives  the  (inverse)  rise  time  and  whose  sign 
gives  the  direction  of  rise  (to  right  or  left).  If  the  above  density  is  substituted  in 
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Eq.  8  there  results 
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provided  the  following  assumption  are  made;  that  P  can  assume  only  the  value -P^ 

for  class  j  =  1  and  the  value  P^  for  class  j  =  2,  and  that  a  ((i)  is  the  a-priori  density 
of  H  for  both  classes.  Now  consider  the  function  (x)  with  £  =  0  and  with 
p — *.00.  The  integrand  in  the  denominator  vanishes  unless  is  less  than  6^ 


(1) 


(2)  W 
h  h 
(1)  (2) 
h  '  h  “ ' 

Therefore, 


A^  (x) 


where  a  (u) 


and  the  integrand  in  the  numerator  vanishes  unless  H  is  less  than  x, 

i?’. 


=  b 
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if  X  >  §2  some  t 

if  X  <  ^  for  all  t 


(19c) 


U  ll) 

J  a  ((A)  dfA.  The  following  diagram  shows  that  if  all  the  are 
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larger  than  every  ,  and  if  a  ((a)  is  almost  constant  in  the  region  of  interest,  the 
boundary  will  be  placed  between  the  groups; 
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iz  iz  ^2 
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If  the  sample  points  overlap,  P^  must  be  set  equal  to  some  large  value  in  order  to 
avoid  an  indeterminate  boundary. 

2.  5  Non- stationary  model  a)  The  generalization  (of  the  basic  model  of  Section  2.  2) 
which  will  be  developed  here  is  useful  for  cases  vhere  the  statistics  of  the  pattern 
generator  vary  as  functions  of  time.  Let  Tj,  ^2  •  •  •  be  a  series  of  points  in 
time  at  each  of  which  one  sample  property  vector  for  each  class  is  available.  The 
t' s  are  to  be  arranged  in  increasing  order,  and  t  is  to  be  a  time  later  than  any  of 
the  calibration  times  at  which  the  recognition  is  to  be  made.  Assume  that  the  Joint 
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density  of  the  properties  is  p  t)  at  time  t>  And  that  the  Joint  density  of  all 


calibration  data  is  now 


n  p  ij  T, 


All  of  the  steps  of  Section  Z.  2  can  be  carried  thru  analogously  leading  to  the  following 
instead  of  £q<  4: 

/  pfxjo,  r)  n  P  P;  Tj)  A,(o)da 

P(J..)  -Pj  .-JL2LJLo{i)  M 


^0, 


ft) 


\  ^  d  (t)  N 

where  ^  ^  P \ii  '^ty  ^i 

Since  G  does  not  depend  on  j,  the  classification  is  again  that  value  of  j  which 
maximizes  the  first  two  factors  on  the  right  side  of  Eq.  21. 

Some  interesting  questions  are  raised  in  examining  the  above  process. 
None  of  the  calibration  times  can  be  greater  than  Ti  and  if  the  usual  case  of 
T  greater  than  all  of  the  Tj.  is  considered,  the  pattern  recognition  process  also  in¬ 
volves  an  element  of  prediction.  The  location  of  the  boundaries  between  the  regions 
representing  the  classes  in  property  space  must  be  predicted  at  time  t.  The 
boundaries  depend  on  two  probability  densities,  eg. 

pj  p(xii;iT)=  Pi  p(xii;;T) 

determines  the  boundary  x  =  9  (t)  between  the  i  and  j  regions.  Note  that  the  para¬ 
meters  u  are  not  assumed  to  be  functions  of  time,  but  that  this  does  not  essentially 
restrict  the  generality  of  the  method.  Thus,  a  possible  form  for  the  probability 
density  is 

1  .2 

p  (x\  o,  m,  b;  ■  e  '  /  (22) 


which  can  be  looked  on  either  as  a  normal  distribution  with  constant  variance  o  and 
changing  mean  m  +  bt,  or  as  a  density  involving  three  parameters  which  do  not  change 
with  time  o- ,  m,  b. 

b)  Another  method  for  handling  a  time-varying  process  is  sometimes  more  appro¬ 
priate  than  that  given  above.  If  the  data  is  gathered  in  such  a  way  that  a  fairly 
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large  sample  is  obtained  before  the  statistics  change,  but  the  recognition  is  to  be 
carried  out  at  a  later  time  when  only  a  small  calibration  sample  is  available,  the 
previously-obtained  knowledge  can  be  entered  thru  the  functions  A.  The  classification 
at  the  later  time  is  for  class  j  which  minimizes 


where  A  (a)  is  a  suitable  function  peaked  at  0  and  where  a ^  is  the  estimate  of  a 
for  class  j  made  at  the  earlier  time.  If  the  early  data  is  from  a  large  sample  and 
if  conditions  are  assumed  not  to  have  changed  too  much  then  A  can  be  made  a  sharply 
peaked  function  and  the  old  data  will  be  more  heavily  weighted.  This  method  does  not 
involve  prediction,  but  it  does  accomplish  somewhat  the  same  thing.  If  p  and  A  are 
both  Gaussian,  a  closed  form  can  be  obtained  for  the  score,  but  it  is  difficult  to 


visualize  for  large  d- 

c)  Still  another  way  of  handling  non- stationary  statistics  is  the  use  of  con- 
.roU  aa  outlined  in  Section  Z.l  below.  Thie  doe,  not  involve  any  prediction,  either, 
but  rather  it  would  be  specially  u.elul  if  the  atati.tic.  ehange  periodically,  or  re- 
turn  to  previously  estimated  values  in  a  non-periodic  manner. 

Tn  illustrate  the  value  of  prediction,  consider  the  case  of  2  properties 


Ifthe  t  index  in  f  represents  time,  and  if  x  is  measured  at  time  t  =  5,  then 
X  should  be  in  cliss  1.  On  the  other  hand,  if  t  does  not  represent  time,  and  if 

does  not  correspond  to  then  x  should  be  in  class  2.  Line  5  is  the  pre¬ 

dicted  or  extrapolated  boundary  for  t  =  5. 
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2. 6  Modifications  of  the  basic  model 

The  basic  model  for  a  self-calibrating  pattern  recognizer  which  was  discussed  in 
Section  2.  2  above  can  be  both  specialized  and  generalized  to  suit  different  conditions. 
The  recognizer  is  essentially  described  by  the  score  function  Aj{x,  |)  of  Eq.8:  Given 
that  X  is  the  property  vector  of  the  pattern  whose  class  is  desired,  and  given  the  cali¬ 
bration  data  i  ,  the  best  choice  for  the  unknown*  s  class  is  that  value  of  j  which  max¬ 
imizes  Aj. 

a)  A  special  case  which  is  frequently  used  is  to  assume  that  all  of  the  components 
of  the  property  vector  are  statistically  independent  if  the  parameters  are  known: 


p{x|a)=  n  p{xjajj) 
i=l 


where  a  denotes  the  parameters  of  property  i  of  class  j.  Making  this  substitution  in 

ji 

Eq.  8,  it  is  easily  shown  that 

A,(x.|)=  p.  n  k  .i(Xi,|.^)  (24) 

“  i=l 

/p{Xi|a)^n^  P(^ji^  A..(a)da 
where  ^ji(’'i’iji^  ~ 

/n  P(ej;>  la)A.i(a)da 
t=l 


The  uncorrelated  Gaussian  case  illustrated  in  Section  2.  3  above  was  in  turn  a  special 
case  of  this. 

b)  Several  generalizations  of  Eq.  8  can  be  obtained,  principally  by  relaxing  some 
of  the  conditions  of  statistical  independence.  The  general  procedure  will  always  be  to 
follow  Bayes*  rule  by  writing  down  the  joint  probability  density  of  all  causes  and  effects 
and  integrating  out  those  causes  which  are  not  desired.  As  is  shown  in  the  Appendix, 
this  will  minimize  the  probability  of  misclassification.  Unfortunately,  the  score  function 
does  not  usually  factor  into  as  simple  a  form  as  Eq.  8.  The  probability  density  function 
which  governs  the  property  vector  will  be  generalized  to  the  form  p{x|  o^,  P)  where  x  is 
a  property  vector  of  r  components,  a ^  is  a  statistical  parameter  vector  pertiment  to  the 
j*th  class,  and  p  is  a  vector  of  statistical  parameters  common  to  all  classes.  In  order 
to  clarify  the  meaning  of  P,  these  parameters  might  be  the  variances  and  covariances 
which  several  worker have  assumed  constant  for  all  classes.  The  property 
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vectors  of  each  of  the  k  classes  are  observed  d  times,  each  such  observation  being  a 
vector  similar  to  x : 


(1) 

(2) 

(d) 
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The  causes  j,  ‘‘2' '  ‘  5k  ^  assumed  to  be  statistically  independent  random  variables, 
but  there  may  be  dependence  between  the  components  of  the  or  the  If  all  of  the 
causes  are  fixed,  the  ■  ffects  and  x  are  assumed  to  be  independent  ran¬ 

dom  variables.  Under  these  restrictions,  the  joint  probability  density  of  all  causes  and 
events  is: 


The  joint  density  of  the  class  of  the  unknown  j  and  the  effects  is 

P(j.  e)  =  p.  / B(£)  / p(x|a  £)  n  A. (a.)  n  P  I  P)  dp  (26) 

i=l  t=l 

The  integrations  with  respect  to  a  can  again  be  factored 

d  ,  . 

I P)  n  P(ij  5’  £)  ^j ^5)  ^5. 

P(j,  e)  =  pJb(W  - ^ -  C(1  I  £)d£  (27) 

J  ^  — 

/ n  p(if^  I  a.  P)A.(a)  da 

J  J 

t=l 

where  G(|  I  P)  =  n  /  n  |  a  ,  p)  A.  (a)  da 

“  i=l  t=l 

Note  that  G  is  the  conditional  density  of  all  calibration  data  given  the  common  parameters 
P;  and  if  the  sample  is  large  enough  G  should  have  a  sharp  pc.i\  at  £  =  £  ,  the  estimated 
values  of  P  and  probably  close  to  the  true  values.  The  function  G  then  behaves  as  a  delta 
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function  and  the  effect  is  to  insert  £  in  the  ratio  above  and  in  B.  Then  if  the  sample 
is  large  enough, 


t=l 

also  behaves  as  a,  delta  function  with  a  result  that 


P(j .  e) »  Pj  P(x|a  j>  P  )  B(P  ) 


This  shows  that  the  recognizer  described  by  Eq.  27  above  give  the  same  result  as  the 
ordinary  method  of  estimating  parameters  in  the  limit  of  large  sample  size. 

c)  The  next  generalization  which  seems  desirable  is  to  remove  the  restriction  that 
the  calibration  vector  gjj^be  statistically  independent.  In  the  case  of  speech  re¬ 

cognition  some  of  the  properties  may  be  related  to  frequencies,  and  a  person  with  a  low 
pitched  voice  may  have  the  frequencies  of  all  the  phonemes  lower  together.  In  the  case 
of  radio  station  recognitioJ^^l  the  ionspheric  conditions  may  be  poor  on  a  certain  day 
(value  of  t)  and  hence  the  amplitudes  of  all  stations  may  be  low.  A  more  general  form 
for  the  joint  density  of  causes  and  effects  is  then  (replacing  Eq.  1): 


P  (c,  e)  =  p 


P(x|aj! 


d 

n 

t=i 


f^ii 


(t) 


.(t) 


Here  it  is  still  assumed  that  the  properties  are  independent  for  different  values  of  t. 

Of  course,  this  could  simply  be  integrated  with  respect  to  the  a  and  called  a  score  func¬ 
tion,  but  the  resulting  expression  would  be  of  little  help  in  actual  applications.  There  is 
a  need  to  again  describe  each  property  vector  by  its  own  probability  density,  and  this 
can  be  done  by  introducing  an  *  undesired"  cause  or  influence  such  as  was  done  in  the 
Appendix.  Let  y  be  a  random  variable  which  describes  the  particular  pattern  generator 
in  use  at  the  present  time;  y  where  ^  is  the  space  of  all  pattern  generators.  The 
probability  density  of  the  property  vector  x  using  generator  y  of  class  j  is  p(x|  a.,  y). 

The  joint  density  of  calibration  vectors  for  a  particular  t  is 


(29) 


where  the  probability  density  of  the  various  generators.  If  Eq.  29  is  inserted 

in  Eq.  28,  and  if  the  a' s  are  independent,  some  manipulation  yields: 
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/ p(x|a)  n  p(i!**li;Yt)AjW'‘l 

PO,  e)  =  p.  / . . .  /  ri(v,) . . .  r,(va)  ’'i'"  -'va 

/  I  P(if  IsVAjWi  „l 


k  .  d 


where  G(^;Yj^**  •  Y(j)  “  P(i^  Yj.)  *^2: 


i=l  t=l 


and  where  p(x|oi)  =  /p(x| a: Y)r{Y)  ‘^Y 


Note  that  if  Yj.  i®  regarded  as  a  known  sampling  time  then 

rt(Yt)= 

and  Eq.  30  reduces  to  the  time-varing  prediction  of  Eq.  21,  If  the  functions  T^Iy)  have 
a  sharp  peak  at  one  of  two  values  of  y  (y^  Yg)*  depending  on  whether  t  is  in  one  or 
another  of  two  disjoint  subsets  of  1,  2  ...  d  ,  thenEq.  30  amounts  to  designing  two 
pattern  recognizers,  one  for  y  -  y^  and  one  for  y  -  Yg’  The  present  method  is  thus 
useful  in  the  usually-difficult  case  of  multimodal  probability  densities,  and  it  permits 
the  decision  region  of  a  particular  class  in  property  space  to  be  concave  or  even  dis¬ 
connected. 

2.  7  Controls  and  regression 

In  the  last  section  a  method  was  discussed  in  which  it  is  attempted  remove  the  in¬ 
fluence  of  an  unwanted  cause  on  the  pattern  recognition.  Some  of  the  properties,  com¬ 
ponents  of  X,  might  be  chosen  not  to  be  influenced  by  the  desired  cause  (the  class  of  the 
unknown)  but  rather  by  the  undesired  cause.  The  idea  is  to  allow  these  "  control”  pro¬ 
perties  to  remove  the  effects  of  the  undesired  cause  by  means  of  their  statistical  depen¬ 
dence  (in  the  Gaussian  case,  correlation)  on  the  properties  which  depend  on  both  the 
class  and  the  undesired  influence.  Assume  that  the  v  'th  class  is  described  by  the  follow¬ 
ing  Gaussian  density: 


—  ■■■'  - -  6  1» 


where  D  is  the  inverse  of  the  covariance  matrix,  m  the  means  (both  the  D  . .  and  the 
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m  constitute  the  a»  s),  and  where  |det  D^ijl  is  the  determinant  of  the  inverse  covar¬ 
iance  matrix.  Supose  that  x^,  x^depend  on  the  class  and  the  undesired  influence  but 
that  X  ,  X  . . .  X  ,  depend  only  on  the  undesired  influence.  Let  the  properties  be 
designated  by  y.  =  x.  -  m.  i  =  1,  2. . .  n  -  e  and  let  the  controls  be  designated  by  z.  = 

V  -  m  i  =  1.  2, . .  n  -  e*.  If  the  D  matrix  is  partitioned  as 

’^e+i  e+i 


4  M- 


n  •  0 — ^ 


where  R,  is  the  transpose  of  R^.  it  is  seen  that  W  does  not  depend  on  the  class  v  The 
quadratic  form  in  the  exponent  of  the  density  can  now  be  expanded: 


(x  -  m)  (x  -  m)  =  y  S  V  y  +  2y  R 
Now  consider  a  change  of  variables: 


(33) 


y  =  y'+H?'z'  ,  £'  =  £' 

Since  the  Jacobian  of  this  transformation  is  unity,  the  probability  density  of  the  primed 
variables  can  be  obtained  merely  by  substituting  in  Eq.  32.  Note  that 


and  therefore  the  new  inverse  covariance  matrix  D  '  is  given  by 


It  is  desired  to  make  the  y'  and  the  z'  statistically  independent,  ie.  to  make  the  off-di 
agonal  portion  of  the  above  matrix  0.  This  requires  that 


as  if  they  were  properties. 
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3. 1  Probability  of  misclassiflcation 

All  of  the  preceding  analysis  has  been  directed  toward  choosing  the  most  probable 
class  of  a  pattern,  and  this  was  done  in  such  a  way  as  to  minimize  the  probability  of 
incorrect  classification.  Now  it  is  desirable  to  consider  just  how  the  actual  error  prob¬ 
abilities  can  be  calculated,  how  large  they  might  be,  and  to  compare  different  models  in 
this  regard. 

The  first  type  of  error  probability  will  be  that  obtained  if  the  true  form  of  the 
probability  density  is  known  and  if  the  parameters  in  this  density  are  known.  This 
probability  if  misclassiflcation  will  be  designed  by  Q^.  Later  other  error  probabilities 
will  be  considered,  eq.  Qj,  the  probability  of  misclassiflcation  where  the  form  of  the 
probability  density  is  known  but  where  the  values  of  the  parameters  must  be  estimated 
from  a  finite  calibration  sample. 

Let  there  be  k  classes  of  patterns,  each  governed  by  an  r  variable  probability 

density  p^(x^.  . . .  x^),  i  =  1,  2,  ...  k.  If  the  density  functions  are  known,  the  situation 

is  quite  simple.  Let  the  a-priori  probability  of  thei'thclass  be  p^,  then  if  the  pattern 

X..  X,  ...  X  is  observed  the  classification  is  that  value  of  i  which  maximizes 
V  I  r 

PjP^(x^,X2...  x^) 

Except  for  boundary  regions  of  zero  probability,  the  x  space  can  be  divided  into  k  dis¬ 
joint  regions  R^,  R2  •  •  •  \  such  that  in  region  R^  the  above  expression  is  larger  for 
class  i  than  for' any  other  class.  The  actual  probability  of  misclassiflcation  is  then 

k 

i=i  ^i 


This  can  be  calculated  directly  and  there  is  no  need  for  an  experiment.  The  simplest 
case  is  two  normally  distributed  classes  with  r  =  1. 


1 


(x  -  m.) 

2<r2 


PiW=  ^ 

^  or 

Taking  m^  <  m^  for  convenience,  R^  is  defined  by 

m,  +  m. 


i  =  1.  2 


-  00  <  X  < 


‘1 


and  R2  by 


mi+  m^ 


<  X  <  +  00 


2 
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The  probability  of  misclassification  is  then  given  by  the  *  error  function*  (defined  as 
in  Jahnke  and  Emde^^^^  and  Cramer respectively),  assuming  =  ?£  = 


1  1  ”  ^1 

Q  =_L.l  erf(  =  1  -  *(■ 

°  2  2  iJT  (T 


2(r 


(38) 


If  or— the  error  approaches  zero.  The  worst  case,  11^2”  results  in  a  probability 

of  error  of  l/2;  Some  intermediate  results  are  tabulated  below: 


m2-m^ 

2<r 

0 

.5 

.  7 

1 

1.5 

2 

00 

Qo 

.5 

.309 

T; 

.  242 

ible  1 

.159 

.067 

.  023 

0 

Note  that  the  density  functions  can  be  fairly  large  at  the  boundary  between  and  R2. 

For  Q  =  .  159  the  boundary  is  one  "  standard  deviation"  from  each  means,  and  the 
density  function  is  about  60%  of  its  maximum  value.  The  two  variable  normal  error 
function,  although  not  appearing  in  every  statistics  book  as  does  the  one  variable  func¬ 
tion,  has  been  extensively  tabulated^^^l  The  simplest  case,  uncorrelated  variables, 
reduces  to  the  two  variable  case  with  a  suitable  rotation  of  coordinates.  Suppose  there 
are  r  properties,  each  with  the  same  separation  of  means  m^  -  each  with  the  same 
standard  deviation  cr  ,  and  all  uncorrelated.  Then  the  two  classes  will  be  separated  by 
r  times  the  separation  in  the  one  dimensional  case,  and  the  probability  of  error  will 

be  r  T 


Q 


1  -  * 


l/r  (m2  -  nij) 


(39) 


Some  representative  results  are  tabulated  below  (in  %  this  time) 


m2-m 


1 


2  & 


— 

- D - 

- TT 

-rr- 

- 

riTF- 

—2 - 

CO 

1 

50 

30.  9 

24.  2 

15.  9 

6.  7 

2.  3 

0 

2 

50 

24.  0 

16.  3 

7.5 

1.  7 

0.3 

0 

4 

50 

15.  9 

8.1 

2.3 

0. 1 

0 

0 

9 

50 

6.  7 

1.  8 

0.1 

0 

0 

0 

25 _ 

50 

0.  6 

0 

0 

0 

0 

0 

Table  2 
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The  conclusion  to  be  drawn  from  this  case  is  that  a  moderate  number  of  properties 
can  be  used  to  make  an  almost-perfect  pattern  recognizer  even  though  each  property 
is  individually  eery  poor,  provided  the  different  properties  are  uncorrelated.  Since 
most  pattern  recognizers  use  several  properties  in  which  the  means  are  separated  by 
say  half  a  standard  deviation  or  more,  and  since  most  results  are  not  as  good  as 
given  above,  the  indication  is  that  many  pattern  recognizes  fail  because  the  properties 
used  are  correlated  with  each  other  in  an  undesirable  way.  Of  course,  not  all  forms 
of  correlation  are  deleterious;  if  the  concentration  ellipse  is  perpendicular  to  the  line 
connecting  the  centroids,  the  performance  can  be  improved  by  the  correlation  since 
a  linear  function  of  the  original  properties  will  have  a  very  large  value  for  • 

Consider  the  case  where  there  are  two  properties  and  two  normally  distributed  classes 
having  the  following  densities: 

p  = - i -  exp 

^  ,  2  r~i 

2ir  (T  -vA-p 


2  X  2 

X  -  2pxy  -f  y 
2<r  ^  (l-P^ 


PZ 


2ir  <r 


exp 


2  2 
(x-m)  -  2p(x-m)  (y-m)  +  (y-m) 

2(r^  (l-.p^ 


By  integrating  these  with  respect  to  x  or  y  from  -  oo  to  +  oo  it  will  be  seen  that  each 
property,  when  considered  alone,  has  a  variance  and  a  difference  of  means  of  m. 
If  p  =  0  the  previous  uncorrelated  case  results.  In  any  case,  it  is  seen  by  symmetry 
that  the  boundary  between  and  R2  the  45°  line  x  +  y  s  m.  The  probability  of 
error  is  then: 


00  00 

/  /  exp 

-00  m-x 


2  ,  2 
X  -  2pky  +  y 

2<r^l-P^) 


dxdy 


This  special  case  can  be  integrated  in  terms  of  the  single  variable  error  function: 


Qo=l-* 


m 


(T  -71+  p 


(40) 


26 


m 

Ztr 

p*' 

0 

.5 

.  7 

1 

1.  5 

2 

GO 

1 

50 

30.9 

24. 2 

15.  9 

6.  7 

2.  3 

0 

0 

50 

24.0 

16.  3 

7.  5 

1.  7 

0.  3 

0 

-.5 

50 

15.9 

8.  1 

2.  3 

0. 1 

0 

0 

-1 

0 

0 

0 

0 

0 

0 

Table  3 


Note  that  p  =  1  is  the  worst  case,  but  here  the  performance  is  merely  degraded  to  that 
obtained  for  the  single  variable  case.  This  fact,  adding  useless  or  correlated  proper¬ 
ties  cannot  degrade  the  performance,  has  been  proved  as  a  general  theorem  in  a  pre- 
vious  report  . 

The  case  where  there  are  more  than  two  classes  is  very  difficult  to  treat  gener¬ 
ally  because  there  are  so  many  ways  to  place  the  centroids  in  the  r  dimensional  proper¬ 
ty  space,  each  one  leading  to  the  integration  of  a  multivariate  normal  density  function 
over  polytopes.  One  method  which  might  be  used  to  obtain  results  wWch  can  be  com¬ 
pared  with  those  given  above  is  to  assume  that  the  centroids  are  themselves  normally 
distributed,  making  the  expected  value  of  the  separation  of  two  classes  a  function  of  the 

standard  deviation  of  the  distribution  of  centroids.  Some  related  calculations  have  been 

(14 ) 

carried  out  by  S.  O.  Rice  in  connection  with  a  communication  problem  .  A  simple 
case  which  can  be  analyzed  is  that  of  four  classes  with  two  properties,  the  centroids  be¬ 
ing  at  (0,  0),  (0,  m),  (m,  0)  and  (m,  m).  If  the  variances  are  all  v  and  the  properties  are 
not  correlated,  the  probability  of  error  is; 


m 


ZjT  or 


/  I 


exp 


-CO 


(41) 


This  integral  can  be  found  as  a  special  case  of  the  bivariate  error  function  tabulated  in 
Pearson^^^l  (Actually,  the  above  can  be  integrated  in  terms  of  the  1  variable  error 
function  if  p  =  0.  )  The  percent  of  misclassified  patterns  is  found  to  be: 


m 

Zar 

0 

.5 

.  7 

1 

1.5 

2 

00 

% 

75 

52.  3 

42.5 

28.4 

12.  9 

4.  6 

0 

Table  4 
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As  might  be  expected,  the  error  probabilities  are  almost  doubled  over  two  classes  in 
one  dimension,  and  more  than  doubled  over  two  classes  in  two  dimensions.  Suppose  it 
is  desired  to  recognize  k  =  2^  classes  with  r  properties.  The  probability  of  error  is 

then  m  m 

=  1 - = -  j  ■"/  exp  -  -V  E  dx.  ...dx 


^0  =  ' 


*{-)  1  "" 
2(r 


For  any  fixed  m/2(r,Q^  approaches  unity  as  r  approaches  infinity.  This  is, of  course,  a 
very  large  number  of  classes  to  recognize,  it  would  be  very  unusual  for  the  number  of 
classes  go  up  exponentially  with  the  number  of  properties. 

3.  2  Communication  theory 

,  The  theory  of  communication  provides  some  useful  ideas,  relations  and  formulas 

for  certain  types  of  pattern  recognition.  Pattern  recognition  is  similar  ,to  communica¬ 
tion  where  the  type  of  channel  noise  is  unknown,  only  samples  of  each  message  are 
available.  Note  particularly  the  fact  that  random  coding  is  often  as  good  as  the  best 
known  systematic  codes,  since  the  designer  of  a  patten|  recognizer  cannot  usually  de¬ 
sign  the  pattern  generator  also.  Let  ^  be  the  expected  value  of  the  square  of  one  mean 
of  a  class,  there  being  k  classes  in  r  dimensional  space,  and  let  (T  be  the  standard  de¬ 
viation  of  one  of  the  (independent)  properties.  The  fundamental  channel  capacity  formula, 

CT  =  TW  log2  (1  +  S/N),  gives 


log,  k  =  —  log,  (1  +-*^) 
^  2  0-'^ 


with  b  <  1 


if  the  signalling  is  to  be  at  a  rate  no  greater  than  the  channel  capacity.  The  probability 
of  error  approaches  zero  as  r-^oo  for  b  fixed,  but  the  rate  of  approach  is  faster  if  b  is 
smaller.  The  formulas  for  probability  of  error  are  ihore  complicated,  but  if  some 
approximations  are  made  to  Eq.  6.  90  of  Fano*  s  book'  ^here  results  for  small  b: 


1.  84  k 


-  -  log,  (i  + 

4  ^  O' 


log,(l  + 


In  pattern  reco 


gnition  the  fact  that  0  as  r-^oo  does  not  have  the  significance  it  does 
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in  communication  since  there  is  seldom  available  an  inejchaustible  supply  of  independent 
properties.  It  would  be  useful  to  known  the  meanings  of  the  analogs  of  Fano' s  Eq.  5.  l64 
and  5. 199  in  pattern  recognition. 

3.  3  Added  error  due  to  finite  calibration  sample 

The  above  analyses  have  assumed  that  the  statistical  parameters  were  known 
quantities,  but  usually  they  are  not  known  and  must  be  estimated  from  calibration  data. 
The  probability  of  incorrect  classification  might  now  be  expected  to  be  larger  due  to 
the  errors  in  parameter  estimation.  Suppose  that  the  j'th  class  is  governed  by  the  prob¬ 
ability  density 

p(x^,X2  ...  •••  V 


The  parameters  a„  are  unknown  and  must  be  estimated  be  observing  d  property  vectors 

from  each  of  the  k  classes.  Let  be  the  v'th  property  from  the  t'th  sample  from 

the  j'  th  class.  The  usual  maximum  liklihood  estimate  of  ^^2  *  '  ’ 
values  which  maximizes 


d 

n 

t»l 


P(^ 


(t) 


£(t). 

V’ 


(43) 


The  estimated  values  of  the  parameters,  a*  {|),  are  then  inserted  into  the  density  and 
multiplied  by  the  a-priori  probabilities  p^: 


L 


This  may  be  regarded  as  a  score  function;  the  class  with  the  highest  score  is  the  class 
of  X.  Another  type  of  score  function  is  given  by  Eq.  8  above.  It  is  often  very  difficult 
to  7elate  the  accuracy  of  the  parameter  estimation  and  the  accuracy  of  the  pattern  recog 
nizer,  and  indeed  the  arguments  leading  up  to  the  optimum  recognizer  of  Eq.  8  do  not 
explicitly  use  parameter  estimation  at  all.  In  the  following  analyses  it  is  usually  most 
convenient  to  take  the  score  function  as  a  starting  point. 

The  probability  of  the  pattern  recognizer  mis  classifying  a  vector  of  class  j  given 
the  calibration  data  |  is 


Q||.>  j)  =  /p(*  I 

A.  A  p  ... 


(45) 
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The  range  of  integration  is  over  that  region  of  space  where  the  jith  score  is  not  the 
largesti  The  expected  value  of  this  Q  over  all  classes  and  calibration  vectors  is  the 
desired  probability  of  misclassification: 


k 

Ql  =  S  Pj  /  G(|) 
j=l  00  ~ 


/  p(x  I  2j)  dx 
Rj(i) 


(46) 


where  Rj  is  the  complement  of  thej'th  class  region  in  property  space,  ie.  the  region 
describe^in  Eq.  45.  Note  that  the  p(l)  in  Eq.  46  in  the  true  density  and  need  not  be  the 
same  as  p(l)  in  Eq.  44.  An  equivalent  form  to  Eq.  45  which  may  be  more  convenient  is: 


A. 

J 


A 

k 


(47) 


where  Pj  is  the  joint  probability  density  of  the  scores  considered  as  random  variables  on 
on  the  condition  the  x  is  chosen  fromthej'th  class.  The  value  of  Q^,  the  probability  of 
e  rror  of  an  optimum  pattern  recognizes  which  knows  the  parameter  values,  is  never 
larger  than ■Q^:.for: 

k  k 

Z  Pj  /  5  Z  Pj  /  p(^l 

j=i  -  j=i 

Rj  Rj(l) 


if  the  Rj  are  the  regions  which  minimize  the  expression  on  the  left.  Multiplying  by 
G(4)  and  integrating  with  respect  to 

Q  <  Q, 
o  -  1 

In  all  of  the  above  equations  the  a ^  are  their  true  values,  although  these  may  not  be 
known  to  the  pattern  recognizer. 

A  example  will  now  be  calculated  to  show  the  importance  of  calibration  errors  in. 
a  simple  case.  Let  there  be  two  normally  distributed  classes  with  equal  variances  <r 
and  describedby  one  property.  The  mean  of  the  first  class  will  be  taken  as  0,  the 
mean  of  the  second  as  m.  If  the  a-priori  probabilities  of  the  two  classes  are  equal, 
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the  probability  of  error  can  be  calculated  from  one  class  alone: 


00  00 


0=1  I  El  {Ai.A2)dA^dA2 


-00 


A. 


A  possible  choice  for  the  score  functions  is  : 

(1)  (fl)  1  (1)  (2)  (d) 

A^(x,ei  ...  )=  Y  +  ...  )  -  X 


(48) 


(1)  (d)  ,  (1)  (2)  (d) 

•••  h  ^  (^2  ) 

Here  the  decision  can  be  made  by  a  discriminant  function  Ap 

,  (1)  (2)  (d)  (1)  (d) 

A  =  2x  -  -  (I2  +  ^2  ^  h  ^1  +  •  •  •  +  ^1  ) 


A  will  also  be  normally  distributed  with  a  mean  -m  and  a  variance  (4  +  -^Zd)  a-  .  The 
probability  of  error  is  now  simply  the  probability  of  A  being  positive: 


1  -  4  ( 


'Nr  +  Td 


(50) 


The  error  approaches  the 'value  of  given  by  Eq.  38  as  d  approaches  infinity, 
crease  in  error  %  is  not  very  significant  as  the  following  table  shows. 


The  in- 


m 

0 

.  5 

.  7 

1 

1.  5 

2 

00 

Z<r 

d=l 

50 

34.2 

28,4 

1 

20.  7 

11.  0 

5.1 

0 

d=oo 

50 

30.  9 

24.  2 

15.  9 

6.  7 

2.  3  ’ 

0 

Table  5 


An  actual  pattern  recognizer  might  make  more  errors  if  the  sample  size  is  small.  In 
the  first  place,  if  the  recognizer  does  not  know  that  the  variances  are  equal,  it  might  be 
tempted  to  apply  a  correction  such  as  Eq.  18  above.  Also,  the  scores  used  here  assume 
that  the  mean  of  class  2  is  greater  than  class  1;  the  decision  is  always  for  class  1  if  x 
is  below  the  average  of  the  two  means  even  if  the  second  sample  mean  happens  to  fall 
belo-  the  first.  The  scores  of  Eq.  48  would  only  be  used  if  it  were  known  beforehand 
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that  the  second  mean  is  above  the  first,  but  the  actual  error  probabilities  are  not 
changed  significantly  in  any  case. 

3.  4  Game  theory  aspects 

In  attempting  to  design  a  pattern  recognizing  machine  for  property  vector  proba¬ 
bility  density  functions  more  general  than  Gaussian,  some  difficulties  arise  even  in  de¬ 
fining  just  what  the  problem  is  from  the  mathematical  viewpoint.  Game  theory  provides 
a  satisfactory  framework  in  whichto consider  such  problems.  One  very  practical  ques¬ 
tion  which  needs  to  be  answered  is  :  How  many  statistical  parameters  should  be  in- 
eluded  in  the  assumed  form  of  the  property  vector  density  ?  If  too  few  parameters  are 
included  the  actual  density  cannot  be  approsdmated  closely  enough,  if  too  many  par  - 
ameters  are  included  their  estimates  made  from  a  small  calibration  sample  will  be  in¬ 
accurate  or  even  impossible.  This  problem  will  occur  in  some  other  guise  even  if  the 
pattern  recognizer  does  not  depend  on  explicit  parameter  estimation,  as  in  Section  2.  2 

above. 

A  pattern  recognizer  ^(d)  will  be  defined  as  a  set  of  rules  whereby  the  class  j  of 
a  property  vector  x  is  decided  upon  after  observing  d  sample  vectors  | .  of  each, of  the 
k  possible  classes.  The  set  of  rules  comprising  £(d)  may  or  may  not  be  stated  in  terms 
of  score  functionsA.(x,|_)  or  assumed  probability  density  functions.  A  sequence  of  pat¬ 
tern  recognizers(£(l),|i2). .  .}  =  ^  occurs  if  the  size  of  the  calibration  sample  d  is  in¬ 
creased.  The -choice  of  the  pattern  recognizer  is  the  move  of  the  first  player.  The  actual 
probability  density  funcUqn  of  the  property  vector  of  thej  'th  class  will  be  assumed  to  be 
p(x  I  j).  This  is  independent  of  d,  and  all  of  the  vectors  x  and  |  .  (t=l,  2  . . .  d,  j=l,  2.  . .  k) 
arl  assumed  to  be  independent  random  variables.  The  choice  of  the  property  vector 
density  is  the  move  of  the  second  player.  The  second  player  might  be  considered 
"  nature"  ,  but  if  "  nature"  is  thought  of  as  indifferent  and  impersonal  this  does  not  re¬ 
present  a  very  satisfying  picture  of  a  competitive  game.  A  better  choice  is  to  regard 
the  second  player  as  a  "  devil' s  advocate"  who  is  trying  to  encourage  the  designer  to 
make  a  good  pattern  recognizer  by  showing  how  poorly  suggested  designs  might  work 
under  the  most  unfavorable  conditions.  The  utility  function  or  payoff  matrix  will  be 
taken  here  as  the  probability  of  misclassification  given  by  Eq.  46  or  Eq.  47.  In  general, 
the  payoff  or  probability  of  misclassification  will  be  some  function  of  the  2  "  moves"  and 

the  sample  size; 

Ql  =  P(x  I  j).  d] 

Games  of  this  type  are  called  two-person  zero-sum  games.  The  first  player  tries  to 
minimize  Q^,  the  second  player  tries  to  maximize  Q^.  An  "  equilibrium  pair"  is  de¬ 
fined  as  a  recognizer  fi  and  a  density  Pj,  such  that 
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attains  a  maximum  at  p  =  ,  p  in  some  fixed  class,  and 

Ql  [<?iPv] 

attains  a  minimum  at  ^ £  in  a.  fixed  class.  There  may  or  may  not  be  an  equil¬ 
ibrium  pair  depending  on  the  function  Q,  a^d  on  the  classes  of  s  and  p’ s  which  are 

^  ,  1  f  (15,16) 

allowed.  If  several  equilibrium  pairs  exist,  they  must  have  the  same  value  for 

A  pattern  recognizer  which  is  one  member  of  an  equilibrium  pair  would  be  an  opti¬ 
mum  design  for  the  class  of  probability  densities  considered.  If  no  equilibrium  pair 
exists  in  the  sense  defined  above,  it  has  been  shown  by  von  Neumann^  '  ^  that  the  ex¬ 

pected  value  of  can  be  minimized  by  adopting  a  "  mixed  strategy",  ie.  by  choosing 
among  the  /S'  s  according  to  some  set  of  probabilities  instead  of  in  a  determinate  manner. 
This  achieves  the  same  result  as  an  equilibrium  pair. 

An  example  will  now  be  given  to  illustrate  the  aspects  of  game  theory  which  are 
involved  here.  Consider  two  classes  with  one  property.  The  mean  of  the  first  class  will 
be  taken  as  0  and  the  mean  of  the  second  as  m.  Only  the  case  d  =  oo  will  be  considered. 
The  pattern  recognizer  will  be  described  by  the  threshold®:  all  x  <  0  will  be  called 
class  1,  all  X  >  0  will  be  called  class  2.  The  probability  densities  will  be  either 
Gaussian,  or  flat: 


p(x|m,  or^)  = 


2/3  0-. 


,  -■/Tor ,  <  X  -  m  <yj  i 


,/y  (T  ^  <  I  X  - 


The  probability  of  misclassification,  assuming  equal  a-priori  probabilities,  is 

r  2  2  1 


1  1 


“  -2<r/ 


*1  /  •• 

2  V2ir  o-j  Q 


/  e'  ‘ 


2  m-0 


1  _  ,0  V  1  /HI  -  0, 

Q-  =  1  -  —  ®  (-) - *  ( - ) 

2  <r^  2  (T^ 


Four  probability  densities  will  be  considered: 


(Tj^  =  .  33  m 
or  2  =  •  ni 


(T  =  .  1  m 


(Gaussian) 


Pi 

Pz 

P3 

P4 

lei 

14.  6 

5.  3 

6.  7 

0 

15.9 

2.  3 

6.  7 

0 

(^3 

15.  0 

3.  3 

7,  7 

1.  9 

Table  6 


There  is  one  equilibrium  pair,  (£j^,  p^). 

According  to  game  theory,  is  preferred  to  the  others,  provided  thai  ihe  only  proba¬ 
bility  densities  considered  are  those  shown. 

In  any  practical  situation  far  more  probability  densities  must  be  considered  than 
inthe  above  example.  The  first  question  is  whether  equilibrium  pairs  exist.  can  al¬ 
ways  be  made  at  least  as  large  as  1  -  Min  p.  for  any  given(£  by  the  proper  choice  of 
p(x  I  j).  To  show  this,  set 


X  €  Rj^(^) 

otherwise 


where  I. {^)=  J  dx 

arid  p,  <  p.  i  =  I,  2  .  .  .  k 
‘  h  -  1 

Then,  according  to  Eq.  46,  Qj  =  1  '  ^  similar  argument  shows  that  Qj  can  always 

be  made  zero  by  proper  choice  of  p(x|j)  for  a  fixed  :  set 


P(x  I  j)  = 


i/r(|) 


X  €  Rj  {|_) 


0 


otherwise 


and  the  bracketed  term  in  Eq.  46  will  vanish.  Now  what  is  the  range  of  with  p(x|  j) 
fixed  and^  variable  7  The  minimum  value  of  Qj  is  attained  by  defining  the  as  follows 

X  e  R°(|)  if  PjP{x  I  j)  >  p.p(x  I  i)  for  all  i 

X  £  Rj  if  PjP(x  I  j)  <  P^p(x  I  i)  for  some  i 

Since  this  definition  is  independent  of  i  ,  Eq.  4c  reduces  to 

k 

Qoo  ■=  £  Pi  /  (52) 

(|) 

The  quantity  is,  by  definition,  independent  of^  and  d  andit  may  even  vanish  if  the 
p's  are  disjoint.  The  largest  value  of  is  obtained  by  putting  x  in  R^  where  j  minimizes 
p.p(x|i).  An  equilibrium  pair  will  only  exist  if  the  row  maximum  is  the  same  as  the 

column  minimum: 

-  -  =  1  -  Min  pi 

where  Q  is  given  by  Eq.  52.  This  pattern  recognizer  would  make  at  least  50%  errors 
00 

for  2  classes.  33%  errors  for  3  classes  etc, ,  and  would  thus  be  a  poor  performer  at 
the  equilibrium  point.  Of  course,  it  might  perform  quite  well  for  probability  densities 
other  than  those  giving  the  equilibrium  point.  Why  design  the  recognizer  for  the  worst 
possible  p' s  (overlapping  densities)  when  these  give  unacceptable  performance  ?  It 
may  be  better  to  restrict  the  p's  to  those  given  less  than  a  certain  fixed  upper 
bound.  It  may  also  be  desirable  to  restrict  the  (0 '  s  so  that  the  physical  realizion  will 
be  reasonably  economical.  Note  that  the  class  of^'s  will  usually  include  recognizer 
which  "know"  the  correct  value  of  the  statistical  parameters  before  estimation,  but 
there  are  no  gounds  to  eliminate  these,  and  they  cause  no  difficulty  when  the  game 
theory  approach  is  used  since  they  do  poorly  against  other  p' s.  These  R' s  which  know 
the  correct  parameter  values  do  cause  difficulty  if  some  formal  method  of  optimization 
such  as  the  calculus  of  variations  is  used  to  find  the  best  estimation  for  fixed  p. 
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4. 1  Pattern  detection 

The  sections  above  treat  the  problem  of  deciding  which  of  k  classes  a  pattern 
belongs  to.  The  present  sections  are  concerned  with  the  problem  of  pattern  detection. 

This  might  be  considered  as  pattern  recognition  with  k  =  2,  the  first  class  being 
"  pattern  present"  .  the  second  class  being  "  no  pattern  -  only  normal  background  pre¬ 
sent"  .  Several  other  differences  between  classification  and  detection  might  be  noted. 

Often  a  detector  is  left  running  continuously  in  time  and  hence  it  observes  overlapping 
samples.  The  criterion  of  performance  for  a  detector  is  often  of  the  Neyman-Pearson 
type.  ie.  the  probability  of  missing  the  desired  pattern  when  it  occurs  is  to  be  minimized 
while  keeping  constant  the  "false  alarm  probability",  the  probability  of  saying  a  pattern 
is  there  when  it  is  not.  This  is  in  contrast  to  the  usual  symmetrical  classification  cri¬ 
terion,  that  of  minimizing  the  average  probability  of  misclassification.  If,  in  detection, 
there  is  one  or  more  types  of  disturbance  from  the  normal  background,  the  problem  may 
be  stated  as  that  of  classifying  into  k-1  disturbance  classes  (only  one  of  which  is  the  de¬ 
sired  disturbance),  and  1  background  class.  This  comes  back  to  another  type  of 
pattern  recognition  in  which  no  decision  is  made  unless  the  maximum  score  function  ex¬ 
ceeds  a  fixed  threshold,  eg.  26  letters  and  "  ambiguous  letter"  As  an  example  of 
the  detection  process,  consider  a  seismic  waveform  as  the  pattern  with  the  following 
classes:  1)  atomic  explosion,  2)  chemical  explosion,  3)  natural  earthquake,  4)  natur¬ 
al  background  -  ie.  all  else.  The  detector  is  desired  to  give  a  warning  when  class  1,  2 
or  3  has  occurred,  and  then  it  is  expected  to  decide  whether  the  disturbance  is  of  class  1 
or  not.  Another  difference  between  detection  and  classification  is  that  in  the  former  the 
event  to  be  detected  might  out  of  control  of  the  experimenter  and  too  rare  to  gather  good 
statistics  on,  while  in  the  latter  controlled  calibration  experiments  can  usually  be  m.ide. 

In  the  worst  case  it  may  be  desired  to  detect  an  event  which  has  not  yet  occured,  le.  to 
detect  any  unusual  deviation  from  the  norm. 

4,  2  Matched  filters 

The  methods  used  by  workers  in  the  field  of  pattern  recognition  seem  entirely 
different  from  those  used  by  communication  engineers,  yet  the  problems  are  very  similar 
examples  of  statistical  detection.  In  the  former  field  it  is  common  to  use  computer  pro¬ 
grams  with  multivariate  gaussian  matrix  methods,  while  in  the  latter  field  one  thinks  of 
matched  RLC  filters  with  additive  noise.  It  is  the  purpose  of  this  section  to  show  the 
connections  between  these  two  approaches,  well  known  to  some  people;  and  then  to  apply 
some  results  in  multichannel  filter  design,  not  published  mainly,  to  pattern  detection. 

The  resulting  system  would  use  simple  analog  devices  working  in  real  time,  and  could 
be  operated  continuously  for  pattern  detection  at  unknown  times. 
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The  problem  of  pattern  recognition  can  be  described  very  briefly  as  follows: 

A  pattern  generator  G  is  given  the  class  j  and  this  results  in  the  pattern  The  recog¬ 
nizer  R  looks  at  the  pattern  tlj  and  decides  on  a  class  h. 


The  problem  of  detection  is  a  special  case  where  j  and  h  take  on  only  two  values,  signal 
present  or  signal  not  present.  The  difficulties  arise  from  several  sources.  First,  while 
j  is  usually  a  selection  of  one  of  a  small  set  of  integers,  \jj  can  be  quite  a  complicated 
pattern.  Second,  if  the  same  choice  j  is  repeated  several  times,  the  pattern  ijj  may  be 
different  each  time.  Third,  while  it  might  be  said  formally  that  there  exists  a  condition¬ 
al  probability  of  ijj  given  j,  p(0  |  j),  the  functional  form  of  this  distribution  are  almost 
never  known  beforehand.  Fourth,  even  if  the  transformation  R  {((/-►h)  could  be  calculated 
on  paper  in  some  optimum  manner,  its  physical  realization  might  be  far  too  expensive. 

A  formal  solution  to  the  problem  of  finding  the  functional  relation  h=  Z{\jj)  of  the 
pattern  recognizer  R  above  can  be  found  by  using  Bayes'  rule  as  showninthe  Appendix  1. 
Thus,  takes  on  the  values  where  h^  is  the  value  of  h  which  maximizes 


g(h  I  Ip)  = 


Pjj  P(<P  I  h) 


(54) 


where  g0  =  pj  pU/  \  j) 

j 

If  the  above  expression  has  several  equal  values  of  h  which  attain  the  maximum,  an  arbi¬ 
trary  rule  can  be  followed  in  picking  one  of  them.  As  shown  in  Appendix  1,  this  defini¬ 
tion  of  Z(0)  minimizes  the  probability  of  niisclassification.  The  actualy  calculation  is 
much  more  complicated  than  it  might  appear  from  this  description  because  i//  may  be  a 
vector  of  many  components. 

The  problem  of  designing  box  R  in  an  optimum  manner  can  be  simplified  by  gen¬ 
eralizing  a  procedure  of  Zadeh  and  Ragazzini^^l  Suppose  that  a  non-singular  transforma¬ 
tion  U  is  added  to  the  block  diagram  shown  above: 
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It  is  obvious  that  if  R'  is  optimum  in  deciding  the  cause  j  of  the  pattern  x,  then  UR' 
is  optimum  in  deciding  the  cause  j  from  the  pattern  tp,  ie,  UR'  is  an  optimum  design 
for  R.  The  box  U  can  be  chosen  so  as  to  make  the  design  of  R'  as  easy  as  possible,  as 
long  as  it  allows  ip  to  be  calculated  from  x.  The  original  suggestion  was  to  make  U  un¬ 
correlate  the  components  of  ^  occur,  or  if  the  components  of  (i/ occur  sequentially,  to  make 
U  flatten  the  spectrum  of  one  of  the  pattern  classes  (noise). 

A  final  modification  of  the  system  will  be  made  before  considering  the  actual 
processes  in  more  detail.  The  symbol  ip  can  represent  a  very  complicated  vector  in 
many  pattern  recognition  problems,  the  design  of  R'  is  certainly  easier  if  x  is  a  simple 
vector,  but  how  can  U  reduce  the  dimensionability  of  a  signal  and  still  be  non-singular  ? 
The  answer  lies  in  the  fact  that  much  of  the  complexity  of  \p  is  due  to  random  perturba¬ 
tions  from  an  ideal  pattern.  Suppose  that  the  system  is  to  recognize  hand-printed 
letters;  most  of  the  difficulty  comes  from  the  fact  that  each  time  a  particular  letter  is 
printed  it  is  slightly  different  from  other  samples,  even  from  the  same  person.  In  the 
case  of  phoneme  recognition,  each  time  a  "long  a"  is  voiced  the  waveform  does  not 
trace  exactly  the  same  values,  even  with  the  same  speaker  using  the  same  word.  The 
pattern  ip  may  be  said  to  be  due  to  a  prototype  v  and  random  disturbances  z; 


The  requirements  on  U  may  now  be  eased  by  asking  only  that  v  can  be  calculated  from 
X,  not  that  v  and  z  can  be.  Since  there  are  fewer  unperturbed  patterns  (prototypes) 
thanperturbedpatterns,  v  can  be  of  much  lower  dimensionality  than  \p,  so  can  x.  It  may 
be  convenient  to  leave  some  of  the  patterns  randomness  in  what  is  called  v,  and  that  x 
may  contain  more  information  than  just  that  necessary  to  calculate  v.  Note  the  advan- 
tages  which  ensue  from  the  block  "  being  representable  by  a  group  of  transformation,  as 
described  in  Reference  13. 

A  basic  procedure  for  designing  a  pattern  recognizer  or  detector  will  now  be  as 
follows:  First,  it  must  be  found  what  class  of  optimum  decision  box  R'  can  be  built. 

This  class  may  be  limited  by  complexity  of  apparatus  required,  by  the  availability  of 
useful  theories  of  design,  and  by  the  information  needed  on  the  statistics  of  the  x.  Next, 
the  box  U  must  be  designed  bearing  in  mind  the  need  to  supply  a  suitable  signal  to  R  ; 
and  the  need  for  preserving  the  essential  information  in  v. 

In  order  to  illustrate  the  design  procedure  for  the  decision  box  R  ,  first  con¬ 
sider  the  case  where  all  r  components  of  the  vector  x  are  statistically  independent 
(for  a  fixed  j),and  where  each  x.  is  normally  distributed  with  the  mean  depending  on  j 
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and  with  equal  variances.  IfAj  is  defined  as  the  natural  log  of  pj  p(x|j) 

Aj(xj,X2...  x^;j)=  l«Pj  +  In  p(x^,  Xg  . . .  x^|j)  (55) 

.  r  2 

Aj{x^....;j)=  Inp.  -  T\n{(y^)  ^  [x.  -  m..  j  (56) 

i=l 


where  <r  is  the  variance  of  the  x' s 

m..  is  the  mean  of  x.  from  class  j  ' 

Ji  ^ 

If  all  of  the  classes  are  equally  probable  then  p^  is  a  constant  and  the  decision  function 
Z(j),  which  is  defined  as  the  value  of  j  which  maximizes  p.  p(x|j),  is  given  by  the  value 
of  j  which  minimizes 


I 

i=l 


(57) 


This  is  recognized  as  theEuclidean  distnace  from  the  vector  x  to  the  vector  _m.,  so  that 
the  decision  box  simply  picks  the  closest  class  centrum  in  the  feature  space  {  x}  • 

In  the  case  of  detection,  there  are  just  two  classes:  noise  where  j  =  0  and  signal  +  noise 
where  j  =  1.  Since  the  noise  usually  has  a  zero  mean,  the  question  is  simply  whether 
the  first  or  second  of  the  quantities 


g,  =  E  xf  -  2  E 

i=l  i=l 


X,  t  I 

i=l 


m 


Ij 


§0 


=  E 

i=i 

is  the  smaller.  There  is  a  signal  present  if  gj  <  gp,  that  is  if 


(58) 


E 


i=l 


r 


(59) 


Ordinarily,  in  pattern  recognition  a  calculation  must  be  made  for  each  of  the  classes  in 
volved,  or  a  device  must  be  made  to  calculate  the  probability  of  each  class  from  the  . 
x' s.  This  can  be  generalized  so  that  k-1  devices  (or  computer  subroutines)  can  make 
a  decision  involving  k  classes,  but  the  loss  of  symmetry  would  probably  outweigh  any 
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simplification  gained  in  most  cases.  Returning  to  the  last  equation,  the  recognizer  is 
simply  linearly  correlating  the  received  set  of  x' s  with  the  signal  m  and  requiring  that 
this  correlation  be  at  least  half  the  correlation  of  m  with  itself.  If  the  x' s  occur  se¬ 
quential  y  in  time,  then  the  last  equation  describes  the  action  of  a  linear  electrical  filter 
whose  impulse  response  is  m^^  the  familiar  North^^^filter  of  radar  usage. 

(3) 

A  generalization  of  the  above  process  can  lead  to  the  filter  designed  by  Dwork 
to  find  a  pulse  In  non -white  noise.  Suppose  now  that  each  class  has  a  correlated  normal 
distribution: 


I  det  D  ^ 

p(x^,  X2...x^  IJ)  - - ^  e 


,  -  7,  D.  .  (x.  -  m..)(x  .  -  m  ) 
1  Lj  1  ji”  i  ji' 


(60) 
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Ordinarily,  both  the  means  m^  and  the  D.^  (which  are  elements  of  the  reciprocal  of 
the  covariance  matrix)  will  be  functions  of  j.  However,  if  signal  and  noise  are  added 
linearly,  the  will  be  independent  of  j.  If  the  noise  has  zero  means,  the  question  of 
signal  +  noise  or  noise  now  depends  on  the  smaller  of: 


gi=  E 

i,  f=i  • 


go  =  ^  ''ii  "i’^f 
i.  £=1 


(61) 


Again,  a  signal  is  present  if  gj  <  g^.  that  is  if 
r  .  r 

^  ^  ^  °it 
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sequentially,  this  describes  (by  a  double  convolution)  the  passage  of  the 


If  the  X.  occur 
X.  thru  two  linear  filters  in  tandem: 


£=1 


„x„  =  X'. 

i  I  1 


(63) 
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Zm.x'.  =  x" 

1  1 

i=l 

The  first  is  a  "  spectrum-flattening  filter"  whose  impulse  response  is  D.^  ^.r+'l' 
tistically  speaking,  this  first  filler  decorrelates  the  variables  x..  The  second  filter  is 
again  a  "  matched  filter*  for  the  signal  m..  Incidently,  the  first  filter  can  be  looked  on 
as  an  example  of  box  U  as  described  in  the  last  section,  and  is  in  fact  the  same  as  the 
employed  by  Zadeh  and  Ragazzini. 

The  general  correlated  normal  distribution  for  the  x  leads  to  quadratic  (rather 
than  linear)  decision  box.  The  most  probable  class  j  =  Z'(^)is  that  value  of  j  which 
minimizes 


2  “ji<  [-i  -  “jil  h*  ■  "')<]  ■ ; I  “  “ju  ' 

i.  i=i  ^ 


This  might  be  regarded  as  a  generalized  distance  from  the  unknown  to  be  classified,  x, 

to  the  centrum  of  the  j  class,  m..  If  the  covariances  e  are  small  compared  to  the  var- 

2  ^  •’ 
lances  cr  ; 


C- «  =- 

li 


cr.  1  = 
1 


then  an  approximate  formula  for  the  inverse  is 


The  determinant  will  be  approximated  to  the  same  order,  by  considering  only  the  pro¬ 
duct  of  terms  on  the  main  diagonal. 


r  r  1  ‘‘ 
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X  -  z  -i 
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-  m..  X  .  -  m. 
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This  form  permits  first  order  perturbations  of  the  uncorrelated  filter,  and  it  avoids  the 
nescssity  for  calculating  the  inverse  of  the  covariance  matrix. 
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A  matrix  generalization  of  the  North-Dwork  filter  has  been. given  by.  D.  C. 
Youla^^“’l  Suppose  that  there  are  p  channels  withtime  functions  x^(t),  X2(t)  . . .  x^(t). 

Let  w  (t)  be  the  desired  signals  (all  due  to  a  single  disturbance)  at  time  t^)  with  spectral 
W  (f).'*’and  let  the  Fourier  transform  of  the  crosscorrelation  function  between  the  back- 
grtund  noises  in  channels  <)>  and  v  be  The  filter  is  to  have  p  inputs  one  output, 

each  having  a  voltage  function  of  time  associated  with  it.  The  transfer  functions  R^(f) 
between  the  inputs  v  =  1.  2  . . .  p  and  the  output  should  then  satisfy  the  following  simultan¬ 
eous  algebraic  equations  in  order  to  maximize  the  ratio  of  peak  signal  to  RMS  noise  at 

the  filter  output: 

<|>=l,2...p  (b7) 

,  '  '  ZJ  (j)V  V 

^  v=l 

(3) 

If  p  =  1  this  reduces  to  the  Dwork  filter  : 


R^lf)  = 


W'  (f) 


v^here  12  is  the  power  spectrum  of  the  noise.  These  equations  do  not  generally  yield 
realizable  network  functions,  but  allowing  a  delay  in  the  output  permits  sufficiently  good 
approximations  to  be  made  with  materially  affecting  the  results.  To  obtain  the  best  fil¬ 
ter  if  only  a  finite  delay  is  allowed,  simultaneous  Wiener-Hopf  integral  equations  must 
be  solved^^^l  A  diagram  of  a  detection  system  using  multiple  matched  filters  is  shown 
in  Fig.  3.  The  voltage  in  channel  1  has  a  +  pulse  at  the  time  of  the  disturbance  (top  of  ^ 
figure)  t  ,  channel  2  has  a  delayed  -pulse,  channel  3  has  a  +  pulse  followed  by  a  -pulse. 
The  filter  R  combines  these  in  such  a  way  that  contributions  from  all  channels  add  up 
at  time  t  +  t^  and  exceed  the  threshold  0. 

The  major  problem  in  the  design  of  such  a  system  is  to  decide  on  the  boxes  Uj, 

U  U  .  These  must  provide  signals  and  noises  for  which  a  linear  filter  is  reason¬ 

ably  efficfent  at  filtering  the  signal  from  noise.  Whereas  a  straight-forward  design 
procedure  has  just  been  given  for  r'.  no  such  procedures  are  known  for  the  U’ s.  This 
is  the  familiar  problem  of  property  selection  in  pattern  recognition.  Several  genera^ 
methods  for  selecting  properties  have  been  given  in  a  previous  report  .  m  Minsky  s 
review  article<”),  and  in  Section  6  below  The  best  method  to  use  if  the  pattern  is  of 
the  type  used  in  radio  station  recognition^'"’  ''  is  probably  a  combination  of  band  pass 
filters  and  the  operator  sequences  of  Selfridge  and  Minsky  .  The 
be  quite  similar  to  those  used  in  the  radio  station  recognition  experiments  ’  ex- 
cept  that  they  would  have  running  counts,  averages,  etc.  The  carrier  signal  or  even  the 
AGC  voltage  could  not  itself  be  fed  to  the  filter  R  because  they  do  not  repeat  accurately 
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W  (f),  and  let  the  Fourier  transform  of  the  crosscorrelation  function  between  the  back¬ 
ground  noises  in  channels  ^  and  v  be  The  filter  is  to  have  p  inputs  one  output, 

each  having  a  voltage  function  of  time  associated  with  it.  The  transfer  functions  R^(f) 
between  the  inputs  v  =  1.  2  . . .  p  and  the  output  should  then  satisfy  the  following  simultan¬ 
eous  algebraic  equations  in  order  to  maximize  the  ratio  of  peak  signal  to  RMS  noise  at 
the  filter  output; 


w"  (f)  =  Z  ,  (f)  R  (f)  <])  =  1,  2  . . .  p 

4,  ^  (j)V  V 

V=1 


(3) 

If  p  =  1  this  reduces  to  the  Dwork  filter  : 


(67) 


W*  (f) 

R.(f)=  -i— 
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where  is  the  power  spectrum  of  the  noise.  These  equations  do  not  generally  yield 
realizable  network  functions,  but  allowing  a  delay  in  the  output  permits  sufficiently  good 
approximations  to  be  made  with  materially  affecting  the  results.  To  obtain  the  best  fil¬ 
ter  if  only  a  finite  delay  is  allowed,  simultaneous  Wiener-Hopf  integral  equations  must 
be  solved^^^^.  A  diagram  of  a  detection  system  using  multiple  matched  filters  is  shown 
in  Fig.  3.  The  voltage  in  channel  1  has  a  +  pulse  at  the  time  of  the  disturbance  (top  of 
figure)  t  ,  channel  2  has  a  delayed  -  pulse,  channel  3  has  a  +  pulse  followed  by  a  -  pulse. 
The  filter  R  combines  these  in  such  a  way  that  contributions  from  all  channels  add  up 

at  time  t  +  t ,  and  exceed  the  threshold  0. 
o  d 

The  major  problem  in  the  design  of  such  a  system  is  to  decide  on  the  boxes  U^, 

U  ,  . . .  U  .  These  must  provide  signals  and  noises  for  which  a  linear  filter  is  reason¬ 
ably  efficient  at  filtering  the  signal  from  noise.  Whereas  a  straight-forward  design 
procedure  has  just  been  given  for  r',  no  such  procedures  are  known  for  the  U' s.  This 
is  the  familiar  problem  of  property  selection  in  pattern  recognition.  Several  general 
methods  for  selecting  properties  have  been  given  in  a  previous  report  ,  in  Minsky' s 
review  article^^^*,  and  in  Section  6  below.  The  best  method  to  use  if  the  pattern  is  of 
the  type  used  in  radio  station  recognition*^^’  is  probably  a  combination  of  band  pass 
filters  and  the  operator  sequences  of  Selfridge*^^^  and  Minsky*  \  The  ojicrators  might 
be  quite  similar  to  those  used  in  the  radio  station  recognition  experiments  ’  ex¬ 
cept  that  they  would  have  running  counts,  averages,  etc.  The  carrier  signal  or  even  the 
AGC  voltage  could  not  itself  be  fed  to  the  filter  R  because  they  do  not  repeal  accurately 
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enough  for  successive  distrubances.  The  signals  fed  to  R*  would  be  very  low  frequency 
in  order  to  make  the  desired  signal  repeatable.  The  U' s  would  ideally  be  selected  to 
make  the  noises  in  the  various  channels  have  negative  correlations  where  the  signals 
have  positive  correlations,  and  vice  versa. 
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5.1  Radio  station  recognition 

The  problem  of  recognizing  radio  stations  from  their  carrier  curves  has  been 
chosen  as  a  convenient  example  to  illustrate  and  test  concepts  and  methods  of  pattern 
recognition.  Theories  of  the  type  given  in  the  sections  above  should  always  be  supple¬ 
mented  by  information  obtained  from  study  of  the  physical  mechanism  of  the  pattern 
generator  if  the  best  pattern  recognizer  is  desired.  The  principal  thing  desired  is  to 
find  a  suitable  set  of  properties  to  feed  to  the  the  decision  making  part  of  the  recognizer. 
It  is  obvious  that  a  property  which  varies  a  large  amount  between  classes  and  which 
varies  a  small  amount  between  different  samples  of  the  same  class  is  a  very  good  pro¬ 
perty.  This  is  just  saying  that  it  is  desirable  that  the  ratio  of  interclass  variance  to  ' 
intraclass  variance  be  large.  But  note  that  if  the  pattern  generator  is  not  fixed  in  time, 
certain  properties  may  be  very  useful  as  controls  (see  Section  2.  7  above)  even  if  they 
do  not  vary  at  all  from  class  to  class.  Note  also  that  the  numerical  results  of  Section 

3. 1  above  show  that  the  statistical  independence  of  the  different  properties  is  of  great 
importance,  especially  if  the  ratio  of  interclass  variance  to  intraclass  variance  is  not 
large.  A  principle  which  is  useful  in  designing  a  pattern  recognizer  is  that  the  perform¬ 
ance  cannot  be  degraded  by  adding  "noisy*  or  useless  properties,  although,  of  course, 

(13) 

these  properties  may  not  help  the  performance  either'  This  principle  suggests  that 
the  designer  should  be  more  concerned  with  not  overlooking  a  mediocre  property  than 
with  weeding  out  poor  properties. 

The  problem  of  interest  here  is  to  determine  the  location  of  a  sine  wave  radio 
transmitter  in  the  frequency  range  6-20  me,  by  study  of  the  received  signal.  If  propa¬ 
gation  is  by  various  combinations  of  reflections  (or  refractions)  from  different  layers  of 
the  ionosphere  and  reflections  from  the  earth,  and  if  the  condition  of  the  ionosphere 
changes  with  time,  the  received  signal  will  not  be  sinusoidal.  Tests  which  have  actually 
been  made  have  used  the  carriers  of  a  discrete  set  of  international  broadcase  stations 
as  the  sine  wave  sources,  and  since  the  time  variations  caused  by  the  ionosphere  are 
usually  much  slower  than  any  modulation  component,  the  carrier  fading  curve  can  easily 
be  isolated,  say  by  the  receiver's  AGC  time  constant  circuits. 

Propagation  between  two  points  can  usually  take  place  over  an  infinity  of  possible 
paths,  but  under  normal  conditions  these  paths  can  be  grouped,  one  group  corresponding 
to  a  single  hop  involving  the  F  2  layer,  another  involving  two  F  2  hops,  another  involv¬ 
ing  a  combination  of  one  E  layer  hop  and  one  F  2  layer  hop,  etc.  Within  the  group  there 
are  many  paths  quite  similar  in  length  representing  the  scattering  or  non-specular  re¬ 
flection  due  to  ionospheric  inhomogeneties.  Balser  and  Smith^  ^ave  recently  studied 
some  of  the  statistical  properties  of  a  single  mode  (group  of  paths).  Their  work,  toget¬ 
her  with  some  older  results 8u[,;est  the  following  model  for  the  received  signal: 
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G(t)  =  gj(t)  +  ggit)  +  . . .  +  g|^(t) 

Each  of  the  g' s  can  be  statisticallycfescribed  by  giving  its  first  order  probability  density 
and  its  auto  correlationfunction  (spectral  power  density).  The  different  g' s  might  rea¬ 
sonably  be  expected  to  be  statistically  independent  in  the  sense  that  is  a  particular  sta¬ 
tion  is  received  for  about  a  half  hour  the  minute  by  minute  fluctuations  of  parameters 
describing  the  probability  densities  and  autocorrelations  about  their  means  are  inde¬ 
pendent.  The  path  lengths  are  very  long  in  wavelengths  and  the  different  paths  are  re¬ 
flected  from  widely  separated  regions  of  the  ionosphere.  The  g(t)  functions  vary  more 
slowly*^"^^  than  the  G(t)  function,  say  up  to  0.  3  cps  as  compared  to  up  to  10  cps  or  more. 
This  is  because  there  can  be  fast  beats  between  two  g' s.  A  set  of  properties  which 
might  be  quite  useful  in  radio  station  recognition  might  be  the  following: 

X2=vi  ^5=V2  ...  >^3^-1=% 

where  f  is  the  center  frequency  of  g^(t),  v^  is  the  variance  of  g^(t),  and  6,  is  the 
width  0/ the  first  lobe  of  the  autocorrelation  function  of  g^(t).  Several  more  properUes 
can  be  obtained  from  each  g(t),  for  example  the  parameter  "  a"  of  Baker  and  Smith  , 
the  parameter  "  m"  of  Nakagami^^^\  additional  parameters  describing  the  shape  of  the 
autocorrelation  function  etc.  Note  that  the  relative  frequency  range  of  G(t)  and  therefore 
the  f  s  (but  not  their  differences)  will  depend  on  the  local  oscillator  frequency  if  a  linear 
detector  is  used.  If  an  envelope  detector  is  used  the  frequencies  will  not  depend  on  the 
local  oscillator  but  ^  will  usually  be  increased  and  the  different  g' s  will  not  be  independ- 
ent^^^\  Note  also  that  there  is  a  problem  in  how  to  label  the  different  components  in 
the  sequence  g  (t),  g2{t)  •  •  .  because  if  they  are  numbered  by  increasing  frequency  and  if 
the  first  component  fades  out  all  remaining  properties  will  be  permuted.  If  the  number 
ing  is  in  order  of  decreasing  amplitude  (v)  two  nearly-equal  components  can  switch  their 
labels,  but  this  may  not  cause  as  much  difficulty. 

5.2  Radio  station  recognition  results 

The  results  of  some  more  tests  of  the  radio  station  recognition  process  are  tab¬ 
ulated  in  Table  7,  These  tests  were  designed  to  study  the  following  effects: 

1.  The  advantage  of  using  a  correlated  Gaussian  distribution  compared  to  the 
uncorrelated  form.  An  increase  in  the  percent  of  correct  classification  can  be  usually 
noted  if  comparable  tests  are  compared  (10  and  11  differ  only  in  the  inclusion  of  covar- 
iance  terms).  However,  the  use  of  a  correlated  Gaussian  formula  with  a  small  sample 
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moans  tlio  parameter  eatimation  will  be  very  poor  since  there  are  so  many  covariance 
terms.  If  a  separate  calibration  sample  is  used  the  percent  of  correct  classification 
is  subject  to  larj^c  sampling  errors;  if  classification  is  done  on  the  same  sample  which 
was  used  for  calibration  the  result  ends  to  be  much  too  optimistic  (e.  g.  Test  6). 

2.  The  effect  of  the  number  of  properties  used.  It  has  been  shown  in  a  previous 
report that  including  too  many  properties  in  a  pattern  recognizer  cannot  make  the 
expected  percent  of  correct  classification  lower,  although  the  extra  properties  may  not 
improve  the  classification  either.  From  Test  10  to  Test  9  there  is  a  slight  improvement. 
As  might  be  expected,  the  correlated  tests  show  larger  improvements,  eg.  Tests  11 
and  14  are  comparable  and  show  an  improvement  from  45%  to  78%  when  the  properties 
are  increased  from  6  to  12. 

3.  Controls.  The  use  of  a  control  signal  to  partially  compensate  the  effects  of 
time-varying  statistics  has  been  previously  discussedlsections  2.  6  and  2.  7).  Test  11 
was  made  with  6  properties  and  no  control  and  gave  45%  while  Test  15  with  the  same  6 
properties  and  with  control  gave  69%.  There  is  some  indication,  Tests  14  and  15,  that 
controls  are  not  as  good  as  an  equal  number  of  additional  properties. 

4.  The  effect  of  recognizing  on  the  calibration  data  is  to  increase  the  percent  of 
correct  classification.  Even  if  the  data  happened  to  be  entirely  random  and  independent 
of  the  station,  if  only  two  patterns  from  each  station  are  observed  they  would  almost 
certainly  be  correctly  classified  correctly  by  a  recognizer  designed  from  the  same  two 
patterns.  If  many  patterns  are  used  the  results  are  not  quite  so  deceptive  since  a  large 
percent  correct  does  indicate  a  strong  clustering  if  the  sample  points.  Table  7  indicates 
a  lowered  percent  correct  when  the  recognition  is  on  separate  patterns,  but  the  result 
are  still  above  those  to  be  expected  from  a  purely  random  classification.  For  example, 
in  Test  16  anth  8  stations  being  recognized,  if  the  decision  were  purely  random  the  per¬ 
cent  correct  would  be  expected  to  be  12.  5%,  not  34%.  The  large  percent  correct  in 
Test  7  may  be  explained  by  the  fact  that  only  4  stations  were  involved. 

5.  Some  other  effects  have  been  studied  which  do  not  appear  in  Table  7.  A  few 
tests  were  made  with  30  sec.  patterns  instead  of  60  sec.  patterns,  and  the  percent 
correct  with  30  sec.  patterns  was  not  noticeably  lower,  but  there  is  too  little  evidence 
to  come  to  any  definite  conclusion.  There  are  theoretical  reasons  to  believe  that  long¬ 
er  samples  will  give  better  results,  and  this  will  be  tested.  Another  interesting  ques¬ 
tion  is  whether  the  errors  which  were  made  would  be  repeated  in  the  same  way  if  the 
same  recognizer  is  used  on  successive  60  sec.  patterns  on  the  same  day.  One  test  in¬ 
dicates  that  the  error  made  on  a  certain  day  is  repeated  consistantly  more  often  than 
not,  and  therefore  designing  the  pattern  recognizer  to  vary  from  day  to  day  is  import¬ 
ant.  The  effect  of  different  properties  and  different  controls  (including  sunspot  numbers 
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WWV  propagation  reports,  and  indices  of  magnetic  activity)  have  been  investigated  and 
(23) 

will  be  reported  on' 

5,  3  Apparatus 

Three  pieces  of  apparatus  have  been  designed  and  built  on  this  contract  and  in 

(18  i 

connection  with  the  previous  contract:  an  analog-to-digital  converter  ,  a  subsonic 
spectrum  analyzer^^^^  and  a  stable  RF  source^  A  block  diagram  of  the  spectrum 
analyzer  is  shown  in  Fig.  4.  A  photograph  of  the  apparatus  is  shown  in  Fig.  5,  and 
some  waveforms  are  shown  in  Figs.  6,  7.  The  apparatus  is  essentially  finished  and  all 
blocks  have  been  tested,  but  an  overall  test  of  the  closed  loop  has  not  yet  been  made. 
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Spectrum  Analyzer  shown 
with  A-D  Converter  and  Radio 
Receiver 
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Magnetic  Tape  Delay  Unit 


10  cps  Frequency  Shifter  and  - 

FM  Modulator 


Stepped  Attenuator 


FM  Demodulator  and  Input  - 
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Fig.  5  Spectrum  Analyzer 
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I 

1.  Input  Signal  2500  cycles 

.  5V/  Major  division 
.  2  msec.  /Maj.  Div. 

2.  Frequency  Translator  Output 

.  5V/Major  Division 
msec. /Maj.  Div. 

3.  Lissajous  Fig.  Input  versa  Input 

.5V'/ Maj.  Div. 

4.  Lissajous  Fig.  Input  versa  Output 

.  5  V/ Maj.  Div. 

5.  Algebraic  Sum  of  Input  +  Output 

.  5  V/ Maj.  Div. 

.  1  sec/Maj.  Div. 


n 

1.  Input  Signal  1800  cycles 

.5V/Major  Envision 
.2msec/Maj.  Div. 

2.  Frequency  Translator  Output 


.  5  V/ Maj  or  Division 

.  2  msec/Maj.  Div. 

3.  Lissajous  Fig. 

Input  versa  Input 

.  5V/Maj.  Div. 

4.  Lissajous  Fig. 

Input  versa  Output 

.  5V/Maj.  Div. 

5.  Algebraic  Sum  of  Input  +  Output 

.5V/Maj.  Div. 

.  1  sec/ Maj.  Div. 

Fip.  6  Waveforms  from  Frequency  Shifter 


VRI- 19067 
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1.  Frequency  Translator  Output.  (Input  2500  cycles) 

.5V/Maj.  Div. 

.  1  msec/ Maj.  Div. 

2.  FM-Modulator  Output 

20V/ Maj.  Div. 

3.  FM-Modulator  Output 

20v/Maj.  Div. 

4.  Readhead  Output 

.IV/Maj.  Div. 

5.  Readhead  Output 

.IV/Maj.  Div. 

6.  FM- Denominator -Output 

.5  V'/ Maj.  Div. 

.1  msec/Maj.  Div. 


1.  10  Cycles  Sinewave 

IV^/Maj.  Div.  .2  Sec/Maj.  Div. 

2.  Input  Modulator  Output 

.  5  V/Maj.  Div. 

.  2  Sec.  / Maj.  Div. 

3.  Frequency  Translator  Output 

.  5  V/ Maj.  Div. 

.  2  sec/ Maj.  Div. 

4.  Delay-line  Output 

.  5  V/ Maj.  Div. 

.  2  sec/ Maj.  Div. 


Fi-.  7 


Waveforms  from  Delay  Unit  and  Input  Modulator 
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6. 1  General  design  procedure  for  pattern  recognizers 

(13) 

No  attempt  has  been  made  in  the  above  sections  or  in  the  previous  report  to 
cover  all  methods  of  pattern  recognition  which  have  been  suggested  in  the  literature. 

Most  of  the  published  attempts  at  pattern  recognition  have  been  specifically  tailored  to 
one  application  and  contribute  little  to  a  generalized  theory  of  pattern  recognition.  This 
is  not  to  say  that  a  very  good  pattern  recognizer  cannot  be  made  by  someone  who  knows 
nothing  of  statistics  and  pattern  recognition  theory,  but  who  knows  much  about  the  part¬ 
icular  pattern  generating  mechanism  or  who  can  intuitively  see  how  humans  are  perform¬ 
ing  the  same  recognition.  Indeed,  it  would  be  foolish  not  to  use  all  knowledge  which  is 
available  concerning  the  particular  field  in  which  the  pattern  recognition  is  to  take  place. 
All  special  tricks  which  can  be  thought  of  should  be  employed.  The  complete  theory 
should  desirable  be  general  enough  to  include  all  special  cases.  However,  there  are 
interesting  areas  where  the  more  formal  theory  is  the  only  guide:  Humans  have  no  ex¬ 
perience  in  recognizing  the  pattern,  a  complete  physical  theory  of  the  pattern  generator 
is  impractical,  and  the  pattern  itself  is  too  complex  for  simple  enumerative  techniques. 

It  is  not  felt  that  a  complete  formal  theory  of  pattern  recognition  is  yet  at  hand.  The 
following  is  meant  to  be  a  checklist  of  idea  which  might  serve  as  components  in  such  a 
theory.  More  details  on  each  topic  can  be  found  in  the  references  cited.  The  reviews 
of  Minsky^^^^,  Hawkins ^^*^^Sebestyn^^^^,  Weisz  etal  and  the  Bionics  Symposium  Re¬ 

port  should  provide  references  to  most  of  the  work  in  pattern  recognition  up  to  1961. 


A.  Property  selection  methods 

1.  Decision  tree:  "Morse  code  recognizer property  lists  and  characters'^ 
more  properties  for  difficult,  pairs  of  classes  sequential  decisions. 

2.  Initial  transformation:  autocorrelation^*^^^,  pre-recognition  cleanup^ 

decorrelation^^^^,  segmentation  of  handwritten  letters normalization  of  size ^  \ 

sampling  of  a  time  function^^^^,  detection  of  RF  signals clustering^^^^. 

(45)  (34) 

3.  Search  and  learning:  pandemoraum  ,  operator  sequences  ,  n-grams 
in  language  recognition^^^^,  hill  climbing^^^^. 

(37  3Q\  (44)  (54) 

4.  Random  selection'  ’  ,  random  networks'  ',  perceptron 

5.  Coding  theory^^^^:  parity  checks^^^^  group  codes^^^^. 

6.  Invariants  to  distorting  transformations:  to  groups  of  transformations 
topological  invariants transformation  not  a  group 

•  •  (17) 

7.  Physical  model  of  pattern  generator  as  a  guide:  radio  station  recogmtion 

speech  recognition  template  matching^^^^ 


(39l40i41) 


8.  Imitate  biological  processes^  spatial  computers  ,  retinal  struc- 

ture^^^^,  search  for  distinctive  features  "  by  eye"  ^ 

B.  Decision  making  in  property  space. 

<551  ■  (27) 

1.  Non-statisiical  methods:  logical^  linear  boundries  ,  convex  cones  . 

potential  fields distance  comparisons^^^^  correlation^  ,  quadratic  boundaries  , 
polynomial  boundaries' 

2.  Choice  of  probability  density  function:  uncorrelated  Gaussian,  correlated 
Gaussian  with  equal  covariance  matrices^®’  ditto  with  separate  covariance  matrix  for 
each  class^^®’  discriminating  functions spherically  symmetrical  densi- 
tiea^®®^  Poison  distributions. 

3.  Parameter  estimation:  classical  methods^®\  hypothesis  testing^  \  no  expli* 
cit  estimation  (see  Section  2  of  this  report). 

(61,62) 

4.  Non- stationary  conditions  (see  Section  2  of  this  report)^prediction 

5.  Miscellaneous  topics:  decision  theory^^^^,  game  theory^  ’  \  context  cor¬ 

rection  after  recognition  after  recognition^®®^  ■  no  decision"  regions  ,  use  of  human 

(63,  64) 

link,  adaptive  filters  . 

6.  Testing:  partition  of  a  fixed  sample  between  calibration  and  recognition  , 

IV  I 

misleading  to  recognize  calibration  points,  is  population  really  Gaussian  . 
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Appendix  -  Bayes*  Rule 


Bayes'  rule  can  be  applied  to  almost  any  situation  where  an  unseen  cause  is  to 
be  inferred  from  an  observed  effect.  The  joint  probability  P  of  a  cause  c(c  =  1,  2  . . .  n) 
and  effect  e  (e  £  E,  some  probability  space)  can  be  written: 

P(c,  e)  =  p^  p(elc) 

where  p^  =  a-priori  probability  of  cause  c 

p(ejc)  =  conditional  probability  of  effect  e  given  cause  c 

The  conditional  probability  of  the  cause  c  given  the  effect  e,  q(c|e),  is  obtained  by  divid¬ 
ing  P  by  the  probability  of  the  effect  e: 


q(c|e)  = 


P{c,  e) 


Pj.  P{e|c) 


Bayes'  rule  is  to  select  the  most  probable  cause  given  the  effect  e,  that  is  to  maximize 
the  above  expression  as  a  function  of  c.  This  defines  a  function  of  e,  say  c(e),  taking  on 
values  1,  2  . . .  n.  The  space  E  is  thereby  divided  into  n  regions  R^,  Rg.  .  •  •  where 
e  c  Rj^  if  and  only  if  c(e)  =  k. 

Note  that  since  the  denominator  does  not  depend  on  c,  only  the  numerator  need 
be  considered  in  the  maximization.  (If  p  is  either  unknown  or  a  constant,  then  p(e|c) 
will  be  maximized  as  a  function  of  c  resulting  in  a  "  maximum  liklihood  decision. ) 

The  probability  of  making  an  incorrect  decision  by  following  Bayes'  rule  is  given 
by; 

n 

^  ^  S  Pk  p(e|k)  de  (A  2) 

k=l  R^^ 


Q  =  1  -  /  P^/p\  P(e|^(e)  )  de 


Consideration  of  the  definition  of  Rj^  and  ^(e)  shows  these  forms  to  be  equivalent,  and  it 
is  obvious  from  the  second  that  Q  will  be  minimized  by  following  Bayes'  rule. 

Some  very  useful  variants  of  Bayes'  rule  can  easily  be  developed.  Suppose  there 
are  two  causes  c^  and  Cg,  but  that  only  c^  is  to  be  inferred.  The  joint  probability  of  cause 
and  effect  is  now  P(Cj,  Cg,  e)  while  the  probability  of  the  desired  cause  given  the  effect 


q(cje)  = 


% 


However,  by  using  the  fact  that  any  probability  formula  remains  value  if  the  same  con¬ 
dition  is  put  in  all  probabilities: 


p{e|c^)=^  1^  p(e|c^,C2) 

A  2  1 

2 

the  following  is  obtained 

p  p(e|c) 
q(Cj|e)=  — - 

‘le 

This  is  the  same  as  completely  ignoring  the  undesired  cause  in  all  formulas  from  the 
beginning. 

The  above  may  be  obvious,  but  is  gives  some  formulas  useful  in  the  following  sit¬ 
uation:  There  are  two  causes  Cj^  and  and  two  effects  and  eg.  Effect  e^  is  influenced 
by  both  causes,  but  effect  e^  is  influenced  only  by  cause  Cg  and  is  independent  of  cause 
c^.  Only  cause  Cj  is  desired.  If  the  last  paragraph  is  reread  it  will  be  found  to  be  per¬ 
fectly  valid  if  e  is  replaced  by  ej^,  Cg,  since  no  mention  was  made  as  to  whether  the  differ¬ 
ent  parts  of  e  might  or  might  not  be  independent  of  c^  or  Cg.  The  last  equation  becomes 

P.  p(e  ,  e  |c.) 
q(ci|ej,e2)=-i - 


The  data  Cg  might  be  called  a  "  control"  or  "  side  information"  since,  although  it  does 
not  depend  on  the  desired  cause  c^,  it  does  depend  on  Cg  which  has  an  undesired  influ¬ 
ence  on  e  .  Such  side  information  should  be  included  with  the  other  effects  .  If  this 


situation  is  looked  on  from  another  angle,  it  may  be  supposed  the  should  be  used  to 
infer  sorm thing  about  C2,  which  could  in  turn  modify  e^  and  so  improve  the  estimate  of 
c  .  This  approach  would  not  only  be  more  complicated,  but  it  could  not  be  more  effec¬ 
tive  than  the  one  given.  Some  other  forms  for  the  above  equation  can  be  obtained  as 
follows: 

~  P'leglc^)  |i(ejc^, 

but  since  does  not  depend  on  c^: 

Combining  these  equations: 

P-  plejc  .  e  ) 
q(cje,e2)=-^ - 


The  meaning  of  the  variables  appearing  in  the  above,  and  the  use  to  which  the 
formulas  might  be  put,  will  perhaps  be  more  clearly  fixed  in  mind  if  a  couple  of  examples 
are  given.  Example  1:  Consider  speech  recognition  and  let  be  the  desire  to  utter  a 
certain  phoneme,  let  C2  be  some  physical  characteristic  of  the  speaker,  say  his  weight. 

Now  a  person' s  weight  win  not  be  influenced  by  which  phoneme  he  tries  to  utter,  but  it 
is  well  known  that  present  speech  recognizers  work  well  on  the  person  who  calibrated  xt 
and  poorly  on  other  persons.  Perhaps  the  weight  will  at  least  give  some  clue  as  to  the 
general  pitch  of  the  speaker,  and  the  recognizer  may  be  able  to  do  better  knowmng  this 
general  pitch.  Example  2:  Suppose  the  usual  Bayes'  rule  situation,  a-priori  probabilities 
unknown,  conditional  probabilities  known,  is  reversed.  In  a  given  cause  effect 
the  a-priori  probabilities  are  known  but  the  "  black  box",  probability  (of  the  outcome 
given  the  input)  is  not  known.  Identify  the  variables  as  follows:  c^  the  input  to  be  guessed, 
c,  the  particular  *  black  box"  in  use,  e^  the  observed  outputs  due  to  the  input  which  is  to 
be  guessed,,  and  62  the  observed  outputs  for  a  calibration  run  during  which  known  inputs 
are  fed  to  the  "black  box".  Since  the  calibration  outcomes  cannot  depend  on  the  selec¬ 
tion  of  the  unknown  input  (which  occurs  later  in  time),  and  since  the  particular  b 
box"  in  use  is  of  no  interest,  the  usual  calibrated  pattern  recognizer  fits  this  Bay 

rule  extension. 
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